Download presentation
Presentation is loading. Please wait.
Published byJoel Cameron Modified over 9 years ago
2
Failures in the System Two major components in a Node Applications System
3
Failures in the System Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System
4
Failures in the System Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System
5
Failures in the System Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System
6
Failures in the System Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System Could cause data loss
7
Unavailability: Defined Data on a node is unreachable Detection: Periodic heartbeats are missing Correction: Lasts until node comes back System recreates the data
8
Unavailability: Measured
9
Replication Starts
10
Unavailability: Measured Replication Starts Question: After replication starts, why does it take so long to recover?
11
Node Availability Storage Software Restart
12
Node Availability Storage Software Restart Software is fast to restart
13
Node Availability: Time Planned Reboots
14
Node Availability: Time Planned Reboots Node updates (planned reboots) cause the most downtime.
15
MTTF for Components Even though Disk failure can cause data loss, node failure is much more often Conclusion: Node failure is more important to system availability
16
Correlated Failures Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes Losing nodes before replication can start can cause unavailability of data
17
Correlated Failures
18
Rolling Reboots of cluster
19
Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad)
20
Coping with Failure
21
Replication Encoding
22
Coping with Failure Replication Encoding 27.3 M Years 3 replicas is standard in large clusters 27,000 Years
23
Coping with Failure Cell Replication (Datacenter Replication)
24
Cell Replication Block A Cell 1 Block A Cell 2
25
Cell Replication Block A Cell 1 Block A Cell 2
26
Cell Replication Block A Cell 1 Block A Cell 2
27
Cell Replication Block A Cell 1 Block A Cell 2
28
Modeling Failures We’ve seen the data, now lets model the behavior.
29
Modeling Failures A chunk of data can be in one of many states. Consider when Replication = 3 3 3 2 2 1 1 0 0 Lose a replica, but still 2 available
30
Modeling Failures A chunk of data can be in one of many states. Consider when Replication = 3 3 3 2 2 1 1 0 0 0 replicas = service unavailable Recovery
31
Modeling Failures Each loss of a replica has a probability The recovery rate is also known 3 3 2 2 1 1 0 0 0 replicas = service unavailable Recovery
32
Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication
33
Modeling Failures Using Markov models, we can find:
34
Modeling Failures Using Markov models, we can find: Nebraska 402 Years
35
Modeling Failures For Multi-Cell Implementations
36
Paper Conclusions Given enormous amount of data from Google, can say: Failures are typically short Node failures can happen in bursts, and are not independent In modern distributed file systems, disk failure is the same as node failure. Built Markov Model for failures that accurately reason about past and future availability.
37
My Conclusions This paper contributed greatly by showing data from very large scale distributed file systems. If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook? Complicated code? Complicated administration?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.