© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Scalable Content-Addressable Network Lintao Liu
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Uncoordinated Checkpointing The Global State Recording Algorithm.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
An Active Reliable Multicast Framework for the Grids M. Maimour & C. Pham ICCS 2002, Amsterdam Network Support and Services for Computational Grids Sunday,
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Tree-Based Density Clustering using Graphics Processors
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
MARCH 27, Meeting Agenda  Prototype 1 Design Goals  Prototype 1 Demo  Framework Overview  Prototype 2 Design Goals  Timeline Moving Forward.
NSF Critical Infrastructures Workshop Nov , 2006 Kannan Ramchandran University of California at Berkeley Current research interests related to workshop.
Overcast: Reliable Multicasting with an Overlay Network CS294 Paul Burstein 9/15/2003.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Peer-to-Peer Distributed Shared Memory? Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan/Bretagne France Dagstuhl seminar, October 2003.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Group Communication Group oriented activities are steadily increasing. There are many types of groups:  Open and Closed groups  Peer-to-peer and hierarchical.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
2007/03/26OPLAB, NTUIM1 A Proactive Tree Recovery Mechanism for Resilient Overlay Network Networking, IEEE/ACM Transactions on Volume 15, Issue 1, Feb.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
Outline Why this subject? What is High Performance Computing?
1 INTRUSION TOLERANT SYSTEMS WORKSHOP Phoenix, AZ 4 August 1999 Jaynarayan H. Lala ITS Program Manager.
Clever Framework Name MARCH 27, Meeting Agenda  Framework Overview  Prototype 1 Design Goals  Prototype 1 Demo  Prototype 2 Design Goals  Timeline.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
Hadoop Aakash Kag What Why How 1.
PREGEL Data Management in the Cloud
EEC 688/788 Secure and Dependable Computing
Scalable Failure Recovery for Tree-based Overlay Networks
Operating System Reliability
Operating System Reliability
Plethora: Infrastructure and System Design
EECS 498 Introduction to Distributed Systems Fall 2017
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Middleware for Fault Tolerant Applications
Advanced Operating System
EEC 688/788 Secure and Dependable Computing
Stack Trace Analysis for Large Scale Debugging using MRNet
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
Distributed Systems and Concurrency: Distributed Systems
Operating System Reliability
Operating System Reliability
Presentation transcript:

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison, WI

– 2 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Preview  Focus on tree-based overlay networks (T-BŌN) Leverage characteristics of hierarchical topologies  MRNet overview  Reliability background  Our approach to T-BŌN reliability Main-memory implicit checkpointing protocol

– 3 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Research Domain  Target distributed system monitors, tools, profilers and debuggers Paradyn, Tau, etc.Paradyn, Tau, etc.  Fault-model: crash-stop failures  TCP-like reliability for multicast and stateful reduction operations  Tolerate all internal node failures Graceful degradation to flat topology

– 4 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks This Year in HPC  Processor Statistics from Top500 List: 7974: Top ten average 18%: ≥ %: clusters 8192: largest cluster 32,768: largest system  In 2005: 65,536 processor system Clusters and MPPs w/ processors will soon be commonplace.

– 5 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Large Scale Challenge #1: Performance  MRNet: Multicast/Reduction Overlay Network T-BŌN for scalable, efficient group communications and data analysesT-BŌN for scalable, efficient group communications and data analyses –Scalable multicast –Scalable reduction –In-network data aggregation

– 6 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks BE Front-End BE MRNet Example: Running Average Filter 1,181,81,271,51,111,221,32 2,131,272,82,27 3,18 4,18 3,18 4,18 7,18 4,18 7,18 3,18 4,183,18 2,131,272,82,27

– 7 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Large Scale Challenge #2: Reliability A system with 10,000 nodes is 10 4 times more likely to fail than one with 100 nodes.

– 8 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Large Scale Challenge #2: Reliability  Leverage characteristics of T-BŌNs to provide highly scalable reliability protocols Logarithmic properties Regularity and predictability –Structure –Communication Inherent data redundancy

– 9 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Approaches to Distributed Reliability  Reliable group communications  Distributed transactions  Rollback-recovery protocols

– 10 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Rollback-Recovery Protocols  Checkpoint/Restart  Challenges: Time Overhead –Checkpointing latency Commit latency (stable storage access) Coordination (coordinated checkpointing) –Recovery latency Calculating recovery point (uncoordinated checkpointing) Space Overhead –Checkpoint storage Multiple/useless checkpoints (uncoordinated checkpointing) Forced checkpoints (communication-induced checkpointing) –Protocol messages Complexity –Heterogeneity –Recovery semantics

– 11 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Approach to T-BŌN Reliability  Framework for studying various recovery protocols in T-BŌNs Specify different recovery protocols for experimentation or customization Cost-benefit analyses of various recovery schemes  Three new rollback-recovery protocols 1.Main-memory implicit checkpoints (MMIC) and state regeneration 2.Uncoordinated checkpoints w/ fast recovery 3.Pure communication-induced checkpoints

– 12 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks MMIC Idea Leverage inherent redundancy of stateful reduction networks Eliminate explicit checkpoints  Use volatile storage Reduces checkpoint latency –Checkpointed state used to regenerate the state of other failed processes  Establish recovery clique Enable efficient recovery

– 13 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Filter State Operations Input: Set of states from a complete set of sibling nodes Output: Regenerated state of parent node Input: States from a parent node and an incomplete set of children nodes. Output: Regenerated state of failed node(s)

– 14 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Filter State Operations (cont’d) Input: States from a node Output: Two states to be assumed by two new sibling nodes jointly responsible for task of original node. Input: Two states from nodes in the network. Output: State to be assumed by a new node responsible for the tasks of the two original nodes.

– 15 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks MMIC: Recovery Semantics 1.Detect failure 2.Establish a recovery clique Set of processes whose persistent state can be used to regenerate that of failed node 3.Identify take-over node Assumes role of failed node 4.Regenerate persistent state of failed node 5.Reintegrate regenerated state into take- over node 6.Resume

– 16 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks BE MMIC Example: Running Average Filter BE S p :8,14 S c3 :2,5S c2 :2,16S c1 :2,27S c0 :2,8

– 17 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks MMIC Example: Running Average Filter BE S p :8,14 S c3 :2,5S c2 :2,16S c1 :2,27S c0 :2,8 1. Detect Failure

– 18 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks MMIC Example: Running Average Filter BE S p :8,14 S c3 :2,5S c2 :2,16S c0 :2,8S c1 :2,27 1. Detect Failure 2. Calculate Recovery Clique

– 19 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks MMIC Example: Running Average Filter BE S p :8,14 S c3 :2,5S c2 :2,16S c0 :2,8S c1 :2,27 1. Detect Failure 2. Calculate Recovery Clique 3. Assign a “take-over” node.

– 20 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks MMIC Example: Running Average Filter 1. Detect Failure 2. Calculate Recovery Clique 3. Assign a “take-over” node. 4. Regenerate lost state into “take- over” node: 4.1 read(S p, S c0, S c3 ) 4.2 decompose(S p, S c0, S c2, S c3 ) → S c1 ’ 4.3 merge(S c1 ’,S c2 ) → S c2 ’ 4.4 write(S c2 ’) → S c2 BE S p :8,14 S c3 :2,5S c2 :2,16S c0 :2,8S c1 :2,27 2,8 8,14 2,5 S c1 ’:2,27 S c2 ’:4,21

– 21 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks MMIC Example: Running Average Filter 5. Update and resume BE S p :8,14 S c3 :2,5S c2 :4,21S c0 :2,8 1. Detect Failure 2. Calculate Recovery Clique 3. Assign a “take-over” node. 4. Regenerate lost state into “take- over” node: 4.1 read(S p, S c0, S c3 ) 4.2 decompose(S p, S c0, S c2, S c3 ) → S c1 ’ 4.3 merge(S c1 ’,S c2 ) → S c2 ’ 4.4 write(S c2 ’) → S c2

– 22 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Outstanding Issues and Other Research Evaluation of new rollback recovery protocols Preemptive vs. non-preemptive recovery Failure zone identification Non-trivial filters Failure detection Topology reconfiguration Modeling Transmission layer reliability Efficient data loss repair

– 23 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks References  Roth, Arnold, and Miller, “MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools”, in SC2003.  Roth, Arnold, and Miller, “Benchmarking the MRNet Distributed Tool Infrastructure: Lessons Learned”, in 2004 High-Performance Grid Computing Workshop.  More to come … see you next year! 

– 24 –© 2005 Dorian C. Arnold Reliability in Tree-based Networks Filter State Operations (cont’d) Input: None Output: Current state of filter object Input: State of a filter object Output: None Side effect: Checkpoint to volatile/stable storage