Failure recovery and Checkpointing in Distributed Systems

Failure recovery and Checkpointing in Distributed Systems
CS 455: Introduction to Distributed Systems Computer Science Department, Colorado State University. - Daniel Sullivan, Pasha Volchak, Tyler Decker 1

Why is this problem important?
High demand services need to be up 24/7 (Netflix, Facebook, Google, Amazon) Failures are complex as it introduces problems of states, checkpoints and rollback, data loss, lost of passing messages, etc. One system goes down so can the rest Can cost money and lives 2

Problem characterization
Loss of messages including several types Lost, delayed, orphaned or duplicate Loss of state = very expensive Storage/replication of data Find the balance of keeping a file in good supply (Bottlenecking, Data loss, System failure) Checkpointing (How much vs how little and efficiency) 3

Trade-off space for solutions in this area
TCP vs UDP for message passing (Network bandwidth vs certainty) System heartbeats and checkpointing (resource cost, loss of progress) Speed and efficiency vs risk of state Task completion vs self sustainability Financial cost vs new software and hardware 4

Dominant approaches to the problem (1)
Data replication Scalable checkpointing systems (dataaware aggregation and compression) Use of cloud resources (resource brokering, scheduling) The election algorithm Leader election Group membership Self stabilization 5

Dominant approaches to the problem (2)
Chaos Monkey Netflix and Stack exchange System scanning Learning algorithms 6

Insights Gleaned Test the limits of your system!
practicing fault tolerance = actual fault tolerance Adapt resources from solutions in other fields to your current problem use of compression combined with use of checkpointing cloud resources Better safe than sorry (use replication !!) Be organized with use of smart algorithms 7

What the problem space in the future would look like
More complexity and size of systems Increase of access globally (device count, more people) Technological progress Increase of demand Resource use Data storage Network bandwidth Increasing emphasis in security 8

Trade-off space and solutions in the future
Chaos monkey/Limit testing High demand of attention vs sustainability Combining resources from multiple areas Cutting edge technologies (McR Engine, compression, cloud resources) This can be expensive Hierarchical layer management Efficiency of data storage/replication 9

Questions? 10

Failure recovery and Checkpointing in Distributed Systems

Similar presentations

Presentation on theme: "Failure recovery and Checkpointing in Distributed Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Failure recovery and Checkpointing in Distributed Systems

Similar presentations

Presentation on theme: "Failure recovery and Checkpointing in Distributed Systems"— Presentation transcript:

Similar presentations

About project

Feedback