Failures in the System  Two major components in a Node Applications System.

Failures in the System  Two major components in a Node Applications System

Failures in the System Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System

Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System

Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System

Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System Could cause data loss

Unavailability: Defined  Data on a node is unreachable  Detection:  Periodic heartbeats are missing  Correction:  Lasts until node comes back  System recreates the data

Unavailability: Measured

Replication Starts

Unavailability: Measured Replication Starts Question: After replication starts, why does it take so long to recover?

Node Availability Storage Software Restart

Node Availability Storage Software Restart Software is fast to restart

Node Availability: Time Planned Reboots

Node Availability: Time Planned Reboots Node updates (planned reboots) cause the most downtime.

MTTF for Components  Even though Disk failure can cause data loss, node failure is much more often  Conclusion: Node failure is more important to system availability

Correlated Failures  Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes  Losing nodes before replication can start can cause unavailability of data

Correlated Failures

Rolling Reboots of cluster

Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad)

Coping with Failure

Replication Encoding

Coping with Failure Replication Encoding 27.3 M Years 3 replicas is standard in large clusters 27,000 Years

Coping with Failure Cell Replication (Datacenter Replication)

Cell Replication Block A Cell 1 Block A Cell 2

Modeling Failures We’ve seen the data, now lets model the behavior.

Modeling Failures  A chunk of data can be in one of many states.  Consider when Replication = 3 3 3 2 2 1 1 0 0 Lose a replica, but still 2 available

Modeling Failures  A chunk of data can be in one of many states.  Consider when Replication = 3 3 3 2 2 1 1 0 0 0 replicas = service unavailable Recovery

Modeling Failures  Each loss of a replica has a probability  The recovery rate is also known 3 3 2 2 1 1 0 0 0 replicas = service unavailable Recovery

Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication

Modeling Failures  Using Markov models, we can find:

Modeling Failures  Using Markov models, we can find: Nebraska 402 Years

Modeling Failures  For Multi-Cell Implementations

Paper Conclusions  Given enormous amount of data from Google, can say:  Failures are typically short  Node failures can happen in bursts, and are not independent  In modern distributed file systems, disk failure is the same as node failure.  Built Markov Model for failures that accurately reason about past and future availability.

My Conclusions  This paper contributed greatly by showing data from very large scale distributed file systems.  If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook?  Complicated code?  Complicated administration?

Failures in the System  Two major components in a Node Applications System.

Similar presentations

Presentation on theme: "Failures in the System  Two major components in a Node Applications System."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Failures in the System  Two major components in a Node Applications System.

Similar presentations

Presentation on theme: "Failures in the System  Two major components in a Node Applications System."— Presentation transcript:

Similar presentations

About project

Feedback