Download presentation
Published byTyree Trevillian Modified over 10 years ago
1
MinCopysets: Derandomizing Replication in Cloud Storage
Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University Unpublished – Please do not distribute
2
Overview Assumptions: no geo-replication, Azure uses much smaller clusters in practice Unpublished – Please do not distribute
3
RAMCloud Primary data stored on master (memory)
Divide each master’s data into chunks Chunks are replicated on backups (disk) When master crashes, recover from thousands of backups Masters Crashed Master Backups Unpublished – Please do not distribute
4
Random Replication Chunk 1 Chunk 2 Chunk 3 Node 1 Node 2 Node 3 Node 4
Chunk 1 Secondary Chunk 2 Primary Chunk 1 Secondary Chunk 3 Primary Chunk 3 Secondary Node 6 Node 7 Node 8 Node 9 Node 10 Chunk 2 Secondary Chunk 1 Primary Chunk 3 Secondary Chunk 2 Secondary Unpublished – Please do not distribute
5
The Problem Randomized replication loses data in power outages
0.5-1% of the nodes fail to reboot 1-2 times a year Result: handful of chunks (GBs of data) are unavailable (LinkedIn ‘12) Sub-problem: managed power downs Software upgrades Reduced power consumption Unpublished – Please do not distribute
6
Intuition If we have one chunk, we are safe:
Replicate chunk on three nodes Data is lost if failed nodes contain three copies of a chunk 1% of the nodes fail: % of data loss If we have millions of chunks, we lose data: 1000 node HDFS cluster has 10 million chunks 1% of the nodes fail: 99.93% of data loss Unpublished – Please do not distribute
7
Mathematical Intuition
A copyset of nodes is a single unit of failure Each chunk is replicated on a single copyset For one chunk, the probability of data loss is: 𝐹 𝑅 𝑁 𝑅 F = number of failed nodes R = replication factor N = number of nodes For all chunks, the probability is: 1− 1− 𝐹 𝑅 𝑁 𝑅 𝑁∙𝐵 B = number of chunks Unpublished – Please do not distribute
8
Changing R Doesn’t Help
Unpublished – Please do not distribute
9
Changing the Chunk Size Doesn’t Help
Unpublished – Please do not distribute
10
MinCopysets: Decouple Load Balancing and Durability
Split nodes into fixed replication groups Random Distribution: Place primary replica on random node Deterministic Replication: Place secondary replicas deterministically on same replication group as primary Unpublished – Please do not distribute
11
MinCopysets Architecture
Chunk 1 Chunk 2 Chunk 3 Chunk 4 Replication Group 1 Node 2 Replication Group 2 Chunk 2 Secondary Node 55 Replication Group 3 Chunk 1 Secondary Node 1 Chunk 4 Primary Chunk 3 Primary Node 83 Chunk 2 Secondary Node 8 Chunk 2 Primary Node 7 Chunk 1 Primary Node 24 Chunk 1 Secondary Node 22 Chunk 4 Secondary Node 47 Chunk 4 Secondary Chunk 3 Secondary Chunk 3 Secondary Unpublished – Please do not distribute
12
Unpublished – Please do not distribute
13
Unpublished – Please do not distribute
14
Unpublished – Please do not distribute
15
Extreme Failure Scenarios
In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities For example: 4000 node HDFS cluster 120 nodes fail to reboot after power outage Only 3.5% probability of data loss Unpublished – Please do not distribute
16
Extreme Failure Scenarios: Normal Clusters
Unpublished – Please do not distribute
17
Extreme Failure Scenarios: Big Clusters
Unpublished – Please do not distribute
18
MinCopysets’ Trade Off
Trades off frequency and magnitude of failures Expected data loss is the same Data loss occurs very rarely The magnitude of data loss is greater Unpublished – Please do not distribute
19
Frequency vs. Magnitude of Failures
Setup: 5000 node HDFS cluster 3 TB per machine R = 3 Power outage once a year Random replication Lose 5.5 GB every single year MinCopysets Lose data once every 625 years Lose an entire node in case of failure Unpublished – Please do not distribute
20
RAMCloud Implementation
RAMCloud implementation was relatively straightforward Two non-trivial issues: Need to manage groups of nodes Allocate chunks on entire groups Manage nodes joining and leaving groups Machine failures are more complex Need to re-replicate entire group, rather than individual nodes Unpublished – Please do not distribute
21
RAMCloud Implementation
Coordinator Server ID ReplicationGroup ID Server 0 5 Server 1 Server 2 Server 3 7 … Request: Assign Replication Group RPC Coordinator Server List Request: Open New Chunk RPC RAMCloud Backup RAMCloud Master Reply: Replication Group Unpublished – Please do not distribute
22
HDFS Implementation Even simpler than RAMCloud
In HDFS replication decisions are centralized on NameNode, in RAMCloud they are distributed NameNode assigns DataNodes to replication groups Prototyped in 200 LoC Unpublished – Please do not distribute
23
HDFS Issues Has the same issues as RAMCloud in managing groups of nodes Issue: Repair bandwidth Solution: Hybrid scheme Issue: Network bottlenecks and load balancing Solution: Kill replication group, re-replicate its data elsewhere Issue: Replication group’s capacity is limited by node with the smallest capacity Solution: Choose replication groups with similar capacities Unpublished – Please do not distribute
24
Facebook’s HDFS Replication
Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss Facebook’s Algorithm: Primary replica is replicated on node j and rack k Secondary replicas are replicated on randomly selected nodes among (j+1,… ,j+5), on racks (k+1, k+2) Unpublished – Please do not distribute
25
Facebook’s Replication
First describe what Facebook did Unpublished – Please do not distribute
26
Hybrid MinCopysets Split nodes into replication groups of 2 and 15
First and second replica are always placed on the group of 2 Third replica is randomly placed on the group of 15
28
Thank You! Stanford University Unpublished – Please do not distribute
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.