MinCopysets: Derandomizing Replication in Cloud Storage

Slides:



Advertisements
Similar presentations
Copysets: Reducing the Frequency of Data Loss in Cloud Storage
Advertisements

Fast Crash Recovery in RAMCloud
Availability in Globally Distributed Storage Systems
Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Replication techniques Primary-backup, RSM, Paxos Jinyang Li.
SDN + Storage.
Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,
Availability in Globally Distributed Storage Systems
RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,
RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
The Google File System.
CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Metrics for RAMCloud Recovery John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
The Hadoop Distributed File System
Google File System Simulator Pratima Kolan Vinod Ramachandran.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
Introduction to Hadoop and HDFS
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
Cloud Testing Haryadi Gunawi Towards thousands of failures and hundreds of specifications.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Presenters: Rezan Amiri Sahar Delroshan
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RAMCloud Overview and Status John Ousterhout Stanford University.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Virtual Machine Movement and Hyper-V Replica
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,
RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.
BIG DATA/ Hadoop Interview Questions.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Map reduce Cs 595 Lecture 11.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Large-scale file systems and Map-Reduce
Chapter 19: Distributed Databases
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
GARRETT SINGLETARY.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
EECS 498 Introduction to Distributed Systems Fall 2017
Distributed File Systems
Presentation transcript:

MinCopysets: Derandomizing Replication in Cloud Storage Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University Unpublished – Please do not distribute

Overview Assumptions: no geo-replication, Azure uses much smaller clusters in practice Unpublished – Please do not distribute

RAMCloud Primary data stored on master (memory) Divide each master’s data into chunks Chunks are replicated on backups (disk) When master crashes, recover from thousands of backups Masters Crashed Master Backups Unpublished – Please do not distribute

Random Replication Chunk 1 Chunk 2 Chunk 3 Node 1 Node 2 Node 3 Node 4 Chunk 1 Secondary Chunk 2 Primary Chunk 1 Secondary Chunk 3 Primary Chunk 3 Secondary Node 6 Node 7 Node 8 Node 9 Node 10 Chunk 2 Secondary Chunk 1 Primary Chunk 3 Secondary Chunk 2 Secondary Unpublished – Please do not distribute

The Problem Randomized replication loses data in power outages 0.5-1% of the nodes fail to reboot 1-2 times a year Result: handful of chunks (GBs of data) are unavailable (LinkedIn ‘12) Sub-problem: managed power downs Software upgrades Reduced power consumption Unpublished – Please do not distribute

Intuition If we have one chunk, we are safe: Replicate chunk on three nodes Data is lost if failed nodes contain three copies of a chunk 1% of the nodes fail: 0.0001% of data loss If we have millions of chunks, we lose data: 1000 node HDFS cluster has 10 million chunks 1% of the nodes fail: 99.93% of data loss Unpublished – Please do not distribute

Mathematical Intuition A copyset of nodes is a single unit of failure Each chunk is replicated on a single copyset For one chunk, the probability of data loss is: 𝐹 𝑅 𝑁 𝑅 F = number of failed nodes R = replication factor N = number of nodes For all chunks, the probability is: 1− 1− 𝐹 𝑅 𝑁 𝑅 𝑁∙𝐵 B = number of chunks Unpublished – Please do not distribute

Changing R Doesn’t Help Unpublished – Please do not distribute

Changing the Chunk Size Doesn’t Help Unpublished – Please do not distribute

MinCopysets: Decouple Load Balancing and Durability Split nodes into fixed replication groups Random Distribution: Place primary replica on random node Deterministic Replication: Place secondary replicas deterministically on same replication group as primary Unpublished – Please do not distribute

MinCopysets Architecture Chunk 1 Chunk 2 Chunk 3 Chunk 4 Replication Group 1 Node 2 Replication Group 2 Chunk 2 Secondary Node 55 Replication Group 3 Chunk 1 Secondary Node 1 Chunk 4 Primary Chunk 3 Primary Node 83 Chunk 2 Secondary Node 8 Chunk 2 Primary Node 7 Chunk 1 Primary Node 24 Chunk 1 Secondary Node 22 Chunk 4 Secondary Node 47 Chunk 4 Secondary Chunk 3 Secondary Chunk 3 Secondary Unpublished – Please do not distribute

Unpublished – Please do not distribute

Unpublished – Please do not distribute

Unpublished – Please do not distribute

Extreme Failure Scenarios In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities For example: 4000 node HDFS cluster 120 nodes fail to reboot after power outage Only 3.5% probability of data loss Unpublished – Please do not distribute

Extreme Failure Scenarios: Normal Clusters Unpublished – Please do not distribute

Extreme Failure Scenarios: Big Clusters Unpublished – Please do not distribute

MinCopysets’ Trade Off Trades off frequency and magnitude of failures Expected data loss is the same Data loss occurs very rarely The magnitude of data loss is greater Unpublished – Please do not distribute

Frequency vs. Magnitude of Failures Setup: 5000 node HDFS cluster 3 TB per machine R = 3 Power outage once a year Random replication Lose 5.5 GB every single year MinCopysets Lose data once every 625 years Lose an entire node in case of failure Unpublished – Please do not distribute

RAMCloud Implementation RAMCloud implementation was relatively straightforward Two non-trivial issues: Need to manage groups of nodes Allocate chunks on entire groups Manage nodes joining and leaving groups Machine failures are more complex Need to re-replicate entire group, rather than individual nodes Unpublished – Please do not distribute

RAMCloud Implementation Coordinator Server ID ReplicationGroup ID Server 0 5 Server 1 Server 2 Server 3 7 … Request: Assign Replication Group RPC Coordinator Server List Request: Open New Chunk RPC RAMCloud Backup RAMCloud Master Reply: Replication Group Unpublished – Please do not distribute

HDFS Implementation Even simpler than RAMCloud In HDFS replication decisions are centralized on NameNode, in RAMCloud they are distributed NameNode assigns DataNodes to replication groups Prototyped in 200 LoC Unpublished – Please do not distribute

HDFS Issues Has the same issues as RAMCloud in managing groups of nodes Issue: Repair bandwidth Solution: Hybrid scheme Issue: Network bottlenecks and load balancing Solution: Kill replication group, re-replicate its data elsewhere Issue: Replication group’s capacity is limited by node with the smallest capacity Solution: Choose replication groups with similar capacities Unpublished – Please do not distribute

Facebook’s HDFS Replication Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss Facebook’s Algorithm: Primary replica is replicated on node j and rack k Secondary replicas are replicated on randomly selected nodes among (j+1,… ,j+5), on racks (k+1, k+2) Unpublished – Please do not distribute

Facebook’s Replication First describe what Facebook did Unpublished – Please do not distribute

Hybrid MinCopysets Split nodes into replication groups of 2 and 15 First and second replica are always placed on the group of 2 Third replica is randomly placed on the group of 15

Thank You! Stanford University Unpublished – Please do not distribute