MinCopysets: Derandomizing Replication in Cloud Storage

Name: MinCopysets: Derandomizing Replication in Cloud Storage
Uploaded: 2017-11-27T09:34:33+00:00
Duration: PTM10S1
Channel: Tyree Trevillian
Description: MinCopysets: Derandomizing Replication in Cloud Storage

MinCopysets: Derandomizing Replication in Cloud Storage
Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University Unpublished – Please do not distribute

Overview Assumptions: no geo-replication, Azure uses much smaller clusters in practice Unpublished – Please do not distribute

RAMCloud Primary data stored on master (memory)
Divide each master’s data into chunks Chunks are replicated on backups (disk) When master crashes, recover from thousands of backups Masters Crashed Master Backups Unpublished – Please do not distribute

Random Replication Chunk 1 Chunk 2 Chunk 3 Node 1 Node 2 Node 3 Node 4
Chunk 1 Secondary Chunk 2 Primary Chunk 1 Secondary Chunk 3 Primary Chunk 3 Secondary Node 6 Node 7 Node 8 Node 9 Node 10 Chunk 2 Secondary Chunk 1 Primary Chunk 3 Secondary Chunk 2 Secondary Unpublished – Please do not distribute

The Problem Randomized replication loses data in power outages
0.5-1% of the nodes fail to reboot 1-2 times a year Result: handful of chunks (GBs of data) are unavailable (LinkedIn ‘12) Sub-problem: managed power downs Software upgrades Reduced power consumption Unpublished – Please do not distribute

Intuition If we have one chunk, we are safe:
Replicate chunk on three nodes Data is lost if failed nodes contain three copies of a chunk 1% of the nodes fail: % of data loss If we have millions of chunks, we lose data: 1000 node HDFS cluster has 10 million chunks 1% of the nodes fail: 99.93% of data loss Unpublished – Please do not distribute

Mathematical Intuition
A copyset of nodes is a single unit of failure Each chunk is replicated on a single copyset For one chunk, the probability of data loss is: 𝐹 𝑅 𝑁 𝑅 F = number of failed nodes R = replication factor N = number of nodes For all chunks, the probability is: 1− 1− 𝐹 𝑅 𝑁 𝑅 𝑁∙𝐵 B = number of chunks Unpublished – Please do not distribute

Changing R Doesn’t Help
Unpublished – Please do not distribute

Changing the Chunk Size Doesn’t Help

MinCopysets: Decouple Load Balancing and Durability
Split nodes into fixed replication groups Random Distribution: Place primary replica on random node Deterministic Replication: Place secondary replicas deterministically on same replication group as primary Unpublished – Please do not distribute

MinCopysets Architecture
Chunk 1 Chunk 2 Chunk 3 Chunk 4 Replication Group 1 Node 2 Replication Group 2 Chunk 2 Secondary Node 55 Replication Group 3 Chunk 1 Secondary Node 1 Chunk 4 Primary Chunk 3 Primary Node 83 Chunk 2 Secondary Node 8 Chunk 2 Primary Node 7 Chunk 1 Primary Node 24 Chunk 1 Secondary Node 22 Chunk 4 Secondary Node 47 Chunk 4 Secondary Chunk 3 Secondary Chunk 3 Secondary Unpublished – Please do not distribute

Unpublished – Please do not distribute

Extreme Failure Scenarios
In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities For example: 4000 node HDFS cluster 120 nodes fail to reboot after power outage Only 3.5% probability of data loss Unpublished – Please do not distribute

Extreme Failure Scenarios: Normal Clusters

Extreme Failure Scenarios: Big Clusters

MinCopysets’ Trade Off
Trades off frequency and magnitude of failures Expected data loss is the same Data loss occurs very rarely The magnitude of data loss is greater Unpublished – Please do not distribute

Frequency vs. Magnitude of Failures
Setup: 5000 node HDFS cluster 3 TB per machine R = 3 Power outage once a year Random replication Lose 5.5 GB every single year MinCopysets Lose data once every 625 years Lose an entire node in case of failure Unpublished – Please do not distribute

RAMCloud Implementation
RAMCloud implementation was relatively straightforward Two non-trivial issues: Need to manage groups of nodes Allocate chunks on entire groups Manage nodes joining and leaving groups Machine failures are more complex Need to re-replicate entire group, rather than individual nodes Unpublished – Please do not distribute

RAMCloud Implementation
Coordinator Server ID ReplicationGroup ID Server 0 5 Server 1 Server 2 Server 3 7 … Request: Assign Replication Group RPC Coordinator Server List Request: Open New Chunk RPC RAMCloud Backup RAMCloud Master Reply: Replication Group Unpublished – Please do not distribute

HDFS Implementation Even simpler than RAMCloud
In HDFS replication decisions are centralized on NameNode, in RAMCloud they are distributed NameNode assigns DataNodes to replication groups Prototyped in 200 LoC Unpublished – Please do not distribute

HDFS Issues Has the same issues as RAMCloud in managing groups of nodes Issue: Repair bandwidth Solution: Hybrid scheme Issue: Network bottlenecks and load balancing Solution: Kill replication group, re-replicate its data elsewhere Issue: Replication group’s capacity is limited by node with the smallest capacity Solution: Choose replication groups with similar capacities Unpublished – Please do not distribute

Facebook’s HDFS Replication
Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss Facebook’s Algorithm: Primary replica is replicated on node j and rack k Secondary replicas are replicated on randomly selected nodes among (j+1,… ,j+5), on racks (k+1, k+2) Unpublished – Please do not distribute

Facebook’s Replication
First describe what Facebook did Unpublished – Please do not distribute

Hybrid MinCopysets Split nodes into replication groups of 2 and 15
First and second replica are always placed on the group of 2 Third replica is randomly placed on the group of 15

Thank You! Stanford University Unpublished – Please do not distribute

MinCopysets: Derandomizing Replication in Cloud Storage

Similar presentations

Presentation on theme: "MinCopysets: Derandomizing Replication in Cloud Storage"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MinCopysets: Derandomizing Replication in Cloud Storage

Similar presentations

Presentation on theme: "MinCopysets: Derandomizing Replication in Cloud Storage"— Presentation transcript:

Similar presentations

About project

Feedback