Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November 29 2006 Automatic Fault Tolerance in ProActive.

Slides:



Advertisements
Similar presentations
Elton Mathias and Jean Michael Legait 1 Elton Mathias, Jean Michael Legait, Denis Caromel, et al. OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis,
Advertisements

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Uncoordinated Checkpointing The Global State Recording Algorithm.
The google file system Cs 595 Lecture 9.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
Denis Caromel1 Denis Caromel, et al. OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis, IUF 3 rd ProActive User Group, Nov Model.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.
CHAPTER FIVE Enterprise Architectures. Enterprise Architecture (Introduction) An enterprise-wide plan for managing and implementing corporate data assets.
1 The Google File System Reporter: You-Wei Zhang.
Implementing Multi-Site Clusters April Trần Văn Huệ Nhất Nghệ CPLS.
Marcelo de Paiva Guimarães Bruno Barberi Gnecco Marcelo Knorich Zuffo
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
JuxMem: An Adaptive Supportive Platform for Data Sharing on the Grid Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan, France Workshop.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
Peer-to-Peer Distributed Shared Memory? Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan/Bretagne France Dagstuhl seminar, October 2003.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
JuxMem: An Adaptive Supportive Platform for Data Sharing on the Grid Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan, France Grid Data.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Cloud Age Time to change the programming paradigm?
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
DataSpace for ProActive Accessing and Managing Remote Files and Data Ankush Kapur, Kamran Qadir,Christian Delbe, Clement Mathieu Team: OASIS INRIA – Sophia.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
1 OASIS Team, INRIA Sophia-Antipolis/I3S CNRS, Univ. Nice Christian Delbé Data Grid Explorer 15/09/03 Large Scale Emulation Mobility in ProActive.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Denis Caromel1 OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis -- IUF IPDPS 2003 Nice Sophia Antipolis, April Overview: 1. What.
XtreemOS IP project is funded by the European Commission under contract IST-FP Scientific coordinator Christine Morin, INRIA Presented by Ana.
Chapter 1 Characterization of Distributed Systems
Introduction to Distributed Platforms
Jack Dongarra University of Tennessee
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
EEC 688/788 Secure and Dependable Computing
Fault Tolerance In Operating System
Storage Virtualization
Supporting Fault-Tolerance in Streaming Grid Applications
RAID RAID Mukesh N Tekwani
Fault Tolerance Distributed Web-based Systems
Middleware for Fault Tolerant Applications
EEC 688/788 Secure and Dependable Computing
Fault Tolerant Distributed Computing system.
EEC 688/788 Secure and Dependable Computing
RAID RAID Mukesh N Tekwani April 23, 2019
Phoenix: A Substrate for Resilient Distributed Graph Analytics
ECE 753: FAULT-TOLERANT COMPUTING
An Implementation of User-level Distributed Shared Memory
Last Class: Fault Tolerance
Presentation transcript:

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive

Christian Delbe2 Fault Tolerance A system is said to be fault tolerant if it can continue operating properly in the event of failure of some of its parts. New requirements for Grid Computing Large scale High failure rate Simultaneous failures Heterogeneous Software Portability Heterogeneous Hardware Different dependability characteristics in each group

Christian Delbe3 Fault Tolerance in Java Rollback-Recovery approach Each process periodically takes a checkpoint Based on the availability of a stable storage Checkpoints are used to recover application in a correct state But Java threads are not checkpointable ! Provide checkpointability with specific tools ? System level, Virtual Machine level, Compiler level Unfortunately … Loss of portability / efficiency Unique and non-standard implementation

Christian Delbe4 Fault Tolerance in ProActive New Communication-Induced-Checkpointing protocol (CIC) Pessimistic Message-Logging protocol (PML) Non-intrusive 100% standard Java, based on serialization Transparent for the programmer Fault tolerance settings in deployment descriptors Based on a Fault Tolerance Server Checkpoint storage Failures detection Resource service (deployed nodes or P2P infrastructure) Localization service

Christian Delbe5 CIC Protocol Overview Creation of a consistent global snapshot Non-blocking synchronization: low failure-free overhead p1 p4 p3 p2

Christian Delbe6 p1 p4 p3 p2 p4 CIC Protocol Overview Creation of a consistent global snapshot Non-blocking synchronization: low failure-free overhead After a failure, the entire system restarts Recovery time increases with system size

Christian Delbe7 PML Protocol Overview Independent checkpoints All messages must be logged Failure free overhead increases with message rate m1 p1 p4 p3 p2 m1

Christian Delbe8 p1 p4 p3 p2 m1 p4 Independent checkpoints All messages must be logged Failure free overhead increases with message rate After a failure, only the faulty restarts Recovery time is system size independent PML Protocol Overview

Christian Delbe9 Performance comparison CIC vs PML Jacobi iteration (SPMD iterative reduction of matrix) on matrix of size and System size increases  Checkpoint size decreases  Message rate increases

Christian Delbe10 Mixing CIC and PML Based on Recovery Groups Independent groups linked with PML After a failure, only the group have to restart Fault Tolerance Servers are independent Groups Dynamically created on common stable server CIC PML CIC PML

Christian Delbe11 Rollback on Grid requirements Large scale + Divide-and-Conquer approach High failure rate + Failure impact limited to the group + Can handle multiple failures Heterogeneous Software + Only Standard Java Heterogeneous Hardware + Can apply the most adapted settings in each group

Christian Delbe12 Performance Comparison CIC vs Mixed Jacobi iteration on and matrix Two groups mapped on two clusters of Grid5000

Christian Delbe13 Performance Comparison CIC vs Mixed Jacobi iteration on and matrix Two groups mapped on two clusters of Grid nodes

Christian Delbe14 Automatic and Transparent Fault Tolerance Easy to use Configured at deployment time Three protocols: Depends on hardware and application properties CIC PML Mixed Next release 3.2 will include Mixed protocol Fault Tolerance in ProActive - Failure Frequency + - Communication Rate +

Christian Delbe15 Performance of the Mixed protocol Jacobi iteration on a matrix Groups mapped on 4 to 6 clusters of Grid5000

Christian Delbe16 CIC Performance Evaluation Jacobi iteration (SPMD iterative reduction of matrix) CG NAS Parallel Benchmark (Conjugate Gradient)

Christian Delbe17 CIC Performance Evaluation Jacobi iteration (SPMD iterative reduction of matrix) CG NAS Parallel Benchmark (Conjugate Gradient)