Failures in the System Two major components in a Node Applications System
Failures in the System Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System
Failures in the System Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System
Failures in the System Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System
Failures in the System Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System Could cause data loss
Unavailability: Defined Data on a node is unreachable Detection: Periodic heartbeats are missing Correction: Lasts until node comes back System recreates the data
Unavailability: Measured
Replication Starts
Unavailability: Measured Replication Starts Question: After replication starts, why does it take so long to recover?
Node Availability Storage Software Restart
Node Availability Storage Software Restart Software is fast to restart
Node Availability: Time Planned Reboots
Node Availability: Time Planned Reboots Node updates (planned reboots) cause the most downtime.
MTTF for Components Even though Disk failure can cause data loss, node failure is much more often Conclusion: Node failure is more important to system availability
Correlated Failures Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes Losing nodes before replication can start can cause unavailability of data
Correlated Failures
Rolling Reboots of cluster
Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad)
Coping with Failure
Replication Encoding
Coping with Failure Replication Encoding 27.3 M Years 3 replicas is standard in large clusters 27,000 Years
Coping with Failure Cell Replication (Datacenter Replication)
Cell Replication Block A Cell 1 Block A Cell 2
Cell Replication Block A Cell 1 Block A Cell 2
Cell Replication Block A Cell 1 Block A Cell 2
Cell Replication Block A Cell 1 Block A Cell 2
Modeling Failures We’ve seen the data, now lets model the behavior.
Modeling Failures A chunk of data can be in one of many states. Consider when Replication = Lose a replica, but still 2 available
Modeling Failures A chunk of data can be in one of many states. Consider when Replication = replicas = service unavailable Recovery
Modeling Failures Each loss of a replica has a probability The recovery rate is also known replicas = service unavailable Recovery
Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication
Modeling Failures Using Markov models, we can find:
Modeling Failures Using Markov models, we can find: Nebraska 402 Years
Modeling Failures For Multi-Cell Implementations
Paper Conclusions Given enormous amount of data from Google, can say: Failures are typically short Node failures can happen in bursts, and are not independent In modern distributed file systems, disk failure is the same as node failure. Built Markov Model for failures that accurately reason about past and future availability.
My Conclusions This paper contributed greatly by showing data from very large scale distributed file systems. If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook? Complicated code? Complicated administration?