Presentation is loading. Please wait.

Presentation is loading. Please wait.

MinCopysets: Derandomizing Replication in Cloud Storage

Similar presentations


Presentation on theme: "MinCopysets: Derandomizing Replication in Cloud Storage"— Presentation transcript:

1 MinCopysets: Derandomizing Replication in Cloud Storage
Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University Unpublished – Please do not distribute

2 Overview Assumptions: no geo-replication, Azure uses much smaller clusters in practice Unpublished – Please do not distribute

3 RAMCloud Primary data stored on master (memory)
Divide each master’s data into chunks Chunks are replicated on backups (disk) When master crashes, recover from thousands of backups Masters Crashed Master Backups Unpublished – Please do not distribute

4 Random Replication Chunk 1 Chunk 2 Chunk 3 Node 1 Node 2 Node 3 Node 4
Chunk 1 Secondary Chunk 2 Primary Chunk 1 Secondary Chunk 3 Primary Chunk 3 Secondary Node 6 Node 7 Node 8 Node 9 Node 10 Chunk 2 Secondary Chunk 1 Primary Chunk 3 Secondary Chunk 2 Secondary Unpublished – Please do not distribute

5 The Problem Randomized replication loses data in power outages
0.5-1% of the nodes fail to reboot 1-2 times a year Result: handful of chunks (GBs of data) are unavailable (LinkedIn ‘12) Sub-problem: managed power downs Software upgrades Reduced power consumption Unpublished – Please do not distribute

6 Intuition If we have one chunk, we are safe:
Replicate chunk on three nodes Data is lost if failed nodes contain three copies of a chunk 1% of the nodes fail: % of data loss If we have millions of chunks, we lose data: 1000 node HDFS cluster has 10 million chunks 1% of the nodes fail: 99.93% of data loss Unpublished – Please do not distribute

7 Mathematical Intuition
A copyset of nodes is a single unit of failure Each chunk is replicated on a single copyset For one chunk, the probability of data loss is: 𝐹 𝑅 𝑁 𝑅 F = number of failed nodes R = replication factor N = number of nodes For all chunks, the probability is: 1− 1− 𝐹 𝑅 𝑁 𝑅 𝑁∙𝐵 B = number of chunks Unpublished – Please do not distribute

8 Changing R Doesn’t Help
Unpublished – Please do not distribute

9 Changing the Chunk Size Doesn’t Help
Unpublished – Please do not distribute

10 MinCopysets: Decouple Load Balancing and Durability
Split nodes into fixed replication groups Random Distribution: Place primary replica on random node Deterministic Replication: Place secondary replicas deterministically on same replication group as primary Unpublished – Please do not distribute

11 MinCopysets Architecture
Chunk 1 Chunk 2 Chunk 3 Chunk 4 Replication Group 1 Node 2 Replication Group 2 Chunk 2 Secondary Node 55 Replication Group 3 Chunk 1 Secondary Node 1 Chunk 4 Primary Chunk 3 Primary Node 83 Chunk 2 Secondary Node 8 Chunk 2 Primary Node 7 Chunk 1 Primary Node 24 Chunk 1 Secondary Node 22 Chunk 4 Secondary Node 47 Chunk 4 Secondary Chunk 3 Secondary Chunk 3 Secondary Unpublished – Please do not distribute

12 Unpublished – Please do not distribute

13 Unpublished – Please do not distribute

14 Unpublished – Please do not distribute

15 Extreme Failure Scenarios
In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities For example: 4000 node HDFS cluster 120 nodes fail to reboot after power outage Only 3.5% probability of data loss Unpublished – Please do not distribute

16 Extreme Failure Scenarios: Normal Clusters
Unpublished – Please do not distribute

17 Extreme Failure Scenarios: Big Clusters
Unpublished – Please do not distribute

18 MinCopysets’ Trade Off
Trades off frequency and magnitude of failures Expected data loss is the same Data loss occurs very rarely The magnitude of data loss is greater Unpublished – Please do not distribute

19 Frequency vs. Magnitude of Failures
Setup: 5000 node HDFS cluster 3 TB per machine R = 3 Power outage once a year Random replication Lose 5.5 GB every single year MinCopysets Lose data once every 625 years Lose an entire node in case of failure Unpublished – Please do not distribute

20 RAMCloud Implementation
RAMCloud implementation was relatively straightforward Two non-trivial issues: Need to manage groups of nodes Allocate chunks on entire groups Manage nodes joining and leaving groups Machine failures are more complex Need to re-replicate entire group, rather than individual nodes Unpublished – Please do not distribute

21 RAMCloud Implementation
Coordinator Server ID ReplicationGroup ID Server 0 5 Server 1 Server 2 Server 3 7 Request: Assign Replication Group RPC Coordinator Server List Request: Open New Chunk RPC RAMCloud Backup RAMCloud Master Reply: Replication Group Unpublished – Please do not distribute

22 HDFS Implementation Even simpler than RAMCloud
In HDFS replication decisions are centralized on NameNode, in RAMCloud they are distributed NameNode assigns DataNodes to replication groups Prototyped in 200 LoC Unpublished – Please do not distribute

23 HDFS Issues Has the same issues as RAMCloud in managing groups of nodes Issue: Repair bandwidth Solution: Hybrid scheme Issue: Network bottlenecks and load balancing Solution: Kill replication group, re-replicate its data elsewhere Issue: Replication group’s capacity is limited by node with the smallest capacity Solution: Choose replication groups with similar capacities Unpublished – Please do not distribute

24 Facebook’s HDFS Replication
Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss Facebook’s Algorithm: Primary replica is replicated on node j and rack k Secondary replicas are replicated on randomly selected nodes among (j+1,… ,j+5), on racks (k+1, k+2) Unpublished – Please do not distribute

25 Facebook’s Replication
First describe what Facebook did Unpublished – Please do not distribute

26 Hybrid MinCopysets Split nodes into replication groups of 2 and 15
First and second replica are always placed on the group of 2 Third replica is randomly placed on the group of 15

27

28 Thank You! Stanford University Unpublished – Please do not distribute


Download ppt "MinCopysets: Derandomizing Replication in Cloud Storage"

Similar presentations


Ads by Google