Download presentation
Presentation is loading. Please wait.
Published bySade Rasbury Modified over 10 years ago
1
Availability in Globally Distributed Storage Systems
Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong,Luiz Barroso, Carrie Grimes, and Sean Quinlan Presented By Ala`a Ibrahim
2
OUTLINE Markov Model Findings Conclusions Introduction Disks failures
Correlated Failures Fault Tolerance MechanismsMarkov Model of Stripe Availability Markov Model Findings Conclusions
3
Data Center
4
Data Center Components
Server Components Interconnects Racks Cluster of Racks
5
Data Center Components
ALL THESE COMPONENTS CAN FAIL Server Components Interconnects Racks Cluster of Racks
6
Cell, Stripe and Chunk Stripe 1 Stripe 2 Stripe 1 Stripe 2
GFS Instance 1 GFS Instance 2 Chunks Chunks Chunks Chunks CELL 2 CELL 1
7
Failure Sources Failure Sources Availability
Hardware – Disks, Memory etc. Software – chunk server process Network Interconnect Power Distribution Unit Availability Reasons of unavailable Overloaded Crash or restart Hardware error Automated repair processes
9
Disks failures Node restarts Planned machine reboots
Unplanned machine reboots Unknown
10
Fault Tolerance Mechanisms
Replication (R = n) ‘n’ identical chunks (replication factor) are placed across storage nodes in different rack/cell/DC Erasure Coding ( RS (n, m)) ‘n’ distinct data blocks and ‘m’ code blocks Can recover utmost ‘m’ blocks from the remaining ‘n-m’ blocks
11
Replication Fast Encoding / Decoding Very Space Inefficient 5 replicas
1 Chunk Fast Encoding / Decoding Very Space Inefficient
12
Erasure Coding ‘n’ data blocks ‘m’ code blocks Encode ‘n + m’ blocks
13
Erasure Coding ‘n’ data blocks ‘m’ code blocks Encode ‘n + m’ blocks
14
Erasure Coding Highly Space Efficient Slow Encoding / Decoding
‘n’ data blocks ‘m’ code blocks Encode ‘n’ data blocks ‘n + m’ blocks Decode Highly Space Efficient Slow Encoding / Decoding
15
Correlated Failures Failure Domain Failure Burst
Set of machines that simultaneously fails from a common source of failure Failure Burst Sequence of node failures each occurring within a time window ‘w’ of the next Window 120 s
16
Correlated Failures… Failure Burst (Window Size)
17
Markov Model Chunk placement policy Cell Simulation
trace-based simulation Priority queue
18
Markov Chain
19
Conclusion The findings provides a feedback for improving
Replication and encoding schemes Recovery rate
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.