Failure recovery and Checkpointing in Distributed Systems

Slides:

Advertisements

Similar presentations

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Chapter 4 Infrastructure as a Service (IaaS)

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Availability in Globally Distributed Storage Systems

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.

Peer-to-Peer Networks as a Distribution and Publishing Model Jorn De Boever (june 14, 2007)

1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.

H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.

Distributed Systems (15-440) Mohammad Hammoud December 4 th, 2013.

Network Support for Cloud Services Lixin Gao, UMass Amherst.

1 The Google File System Reporter: You-Wei Zhang.

Adam Leidigh Brandon Pyle Bernardo Ruiz Daniel Nakamura Arianna Campos.

Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.

Introduction to Cloud Computing Cloud Computing : Module 1.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.

FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.

From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Design of Parallel and Distributed.

VMware vSphere Configuration and Management v6

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

CS426: Building Decentralized Systems Mahesh Balakrishnan.

Section 2.1 Distributed System Design Goals Alex De Ruiter

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Cloud Computing Vs RAID Group 21 Fangfei Li John Soh Course: CSCI4707.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Chapter 1 Characterization of Distributed Systems.

Pouya Ostovari and Jie Wu Computer & Information Sciences

© 2016 Global Market Insights, Inc. USA. All Rights Reserved Fuel Cell Market size worth $25.5bn by 2024 Infrastructure as a Service.

In the name of God.

Chapter 1 Characterization of Distributed Systems

Introduction to Distributed Platforms

An Introduction to Cloud Computing

CS4470 Computer Networking Protocols

Algorithms for Big Data Delivery over the Internet of Things

Introduction to Networks

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Consistency in Distributed Systems

Replication Middleware for Cloud Based Storage Service

Storage Systems for Managing Voluminous Data

Distributed Shared Memory

湖南大学-信息科学与工程学院-计算机与科学系

Content Dissemination Systems Including Streaming Systems

Chapter 1: roadmap 1.1 What is the Internet? 1.2 Network edge

Distributed File Systems

Naman shah Harshil shah Priyank BambhrOLIA

Process Migration Troy Cogburn and Gilbert Podell-Blume

By: Greg Boyarko, Jordan Sutton, and Shaun Parkison

Cyber Physical Systems

Replication and Availability in Distributed Systems

Distributed Computing:

CONSISTENCY IN DISTRIBUTED SYSTEMS

Session I Cloud Introduction Session I

Introduction To Distributed Systems

Distributed Systems (15-440)

Distributed Graph Algorithms

MapReduce: Simplified Data Processing on Large Clusters

Distributed Systems and Concurrency: Distributed Systems

Presentation transcript:

Failure recovery and Checkpointing in Distributed Systems CS 455: Introduction to Distributed Systems Computer Science Department, Colorado State University. - Daniel Sullivan, Pasha Volchak, Tyler Decker 1

Why is this problem important? High demand services need to be up 24/7 (Netflix, Facebook, Google, Amazon) Failures are complex as it introduces problems of states, checkpoints and rollback, data loss, lost of passing messages, etc. One system goes down so can the rest Can cost money and lives 2

Problem characterization Loss of messages including several types Lost, delayed, orphaned or duplicate Loss of state = very expensive Storage/replication of data Find the balance of keeping a file in good supply (Bottlenecking, Data loss, System failure) Checkpointing (How much vs how little and efficiency) 3

Trade-off space for solutions in this area TCP vs UDP for message passing (Network bandwidth vs certainty) System heartbeats and checkpointing (resource cost, loss of progress) Speed and efficiency vs risk of state Task completion vs self sustainability Financial cost vs new software and hardware 4

Dominant approaches to the problem (1) Data replication Scalable checkpointing systems (dataaware aggregation and compression) Use of cloud resources (resource brokering, scheduling) The election algorithm Leader election Group membership Self stabilization 5

Dominant approaches to the problem (2) Chaos Monkey Netflix and Stack exchange System scanning Learning algorithms 6

Insights Gleaned Test the limits of your system! practicing fault tolerance = actual fault tolerance Adapt resources from solutions in other fields to your current problem use of compression combined with use of checkpointing cloud resources Better safe than sorry (use replication !!) Be organized with use of smart algorithms 7

What the problem space in the future would look like More complexity and size of systems Increase of access globally (device count, more people) Technological progress Increase of demand Resource use Data storage Network bandwidth Increasing emphasis in security 8

Trade-off space and solutions in the future Chaos monkey/Limit testing High demand of attention vs sustainability Combining resources from multiple areas Cutting edge technologies (McR Engine, compression, cloud resources) This can be expensive Hierarchical layer management Efficiency of data storage/replication 9

Questions? 10