Copysets: Reducing the Frequency of Data Loss in Cloud Storage

Slides:



Advertisements
Similar presentations
Upgrade services, hosting & business continuity Simpler is better: Delivering IT as a Service John Allen: Business Development Manager Anix Managed Services.
Advertisements

MinCopysets: Derandomizing Replication in Cloud Storage
Fast Crash Recovery in RAMCloud
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Availability in Globally Distributed Storage Systems
Asaf Cidon. , Tomer M. London
Large-Scale Distributed Systems Andrew Whitaker CSE451.
Flashback: A New Control Plane for Wireless Networks Asaf Cidon (Stanford), Kanthi Nagaraj (UCLA), Pramod Viswanath (UIUC), Sachin Katti (Stanford) Stanford.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)
Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
Availability in Globally Distributed Storage Systems
RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,
RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Features Scalability Availability Latency Lifecycle Data Integrity Portability Manage Services Deliver Features Faster Create Business Value.
SEDCL: Stanford Experimental Data Center Laboratory.
Failures in the System  Two major components in a Node Applications System.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
It’s Time for Low Latency Steve Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, John Ousterhout Stanford University 報告者 : 厲秉忠
Assumptions Hypothesis Hopes RAMCloud Mendel Rosenblum Stanford University.
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
Enhanced HA and DR with MetroCluster & Vmware
CSI-09 COMMUNICATION TECHNOLOGY FAULT TOLERANCE AUTHOR: V.V. SUBRAHMANYAM.
RAMCloud: Concept and Challenges John Ousterhout Stanford University.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
High Availability in Clustered Multimedia Servers Renu Tewari Daniel M. Dias Rajat Mukherjee Harrick M. Vin.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RAMCloud Overview and Status John Ousterhout Stanford University.
Optimal Resource Allocation for Protecting System Availability against Random Cyber Attack International Conference Computer Research and Development(ICCRD),
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
Features Scalability Manage Services Deliver Features Faster Create Business Value Availability Latency Lifecycle Data Integrity Portability.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Load Rebalancing for Distributed File Systems in Clouds.
And scales by cloning the app on multiple servers/VMs/Containers Traditional architecture approach Microservices architecture approach A microservice.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Pouya Ostovari and Jie Wu Computer & Information Sciences
Revival 2000-IPSAN based Disaster Recovery solution
Memshare: a Dynamic Multi-tenant Key-value Cache
A Simulation Analysis of Reliability in Erasure-coded Data Centers
Repair Pipelining for Erasure-Coded Storage
Section 7 Erasure Coding Overview
DuraStore – Achieving Highly Durable Data Centers
Unit OS10: Fault Tolerance
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
Distributed File Systems
Transaction Properties: ACID vs. BASE
CS 295: Modern Systems Organizing Storage Devices
Seminar on Enterprise Software
Presentation transcript:

Copysets: Reducing the Frequency of Data Loss in Cloud Storage Asaf Cidon, Stephen M. Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout and Mendel Rosenblum Hi everyone, my name is Asaf Cidon from Stanford University. Today I’m going to talk about techniques that control the frequency of data loss in cloud storage systems. This is joint work with Steve Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout and Mendel Rosenblum. Stanford University

Goal: Tolerate Node Failures Random replication used by: HDFS GFS Windows Azure RAMCloud … Choose random Cloud storage systems typically spray their data across thousands of commodity servers When you have thousands of nodes, there’s a high likelihood of node failures One of the main goals of these systems, is to tolerate node failures The common approach taken by these systems is to replicate data chunks on 3 random servers on different racks This technique is used by most cloud storage systems, and prominently by Hadoop, Google, Windows Azure, RAMcloud If you assume independent failures the chance of losing all three copies are nearly 0

Not All Failures are Independent Power outages 1-2 times a year [Google, LinkedIn, Yahoo] Large scale network failures 5-10 times a year [Google, LinkedIn] And more: Rolling software/hardware upgrades Power down Unfortunately, not all node failures are independent There are frequent correlated failures, where multiple nodes fail at the same time These can be caused by cluster power outages, where the entire cluster loses power, and a small percentage of nodes (usually about 1%) doesn’t reboot properly

Random Replication Fails Under Simultaneous Failures In this talk we’ll focus on one particular failure, namely – power outages, where some percentage of machines fails to reboot This graph shows the probability of losing all copies of at least one chunk, on the y axis, as a function of the number of nodes in the cluster in the x axis, when 1% of the nodes fail at the same time As you can see the probability of losing data in this scenario is very high, this effect has been documented by several systems operators, like Yahoo, LinkedIn and Facebook Any three nodes that fail, there are probably some blocks in common Confirmed by: Facebook, Yahoo, LinkedIn

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 To understand why this happens, let’s look at the following example. Assume we have a cluster of 9 nodes. Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node number randomly places a chunk on nodes 5 and 6 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 From the perspective of this single chunk, we will only lose data if nodes 1, 5 and 6 fail at the same time Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Now let’s add another chunk, from node 2, that is randomly replicated on nodes 6 and 8. Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Now we will only lose data if we either lose the combination of nodes 1, 5 and 6, or the combination of nodes 2, 6 and 8 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets: It’s time to introduce a key concept of our paper, called a copyset. A copyset is a set of nodes that contain all copies of a single chunk. For example, Nodes 1, 5 and 6 form a copyset. A copyset is in essence a unit of failure – because if all nodes of a copyset fail at the same time, we will lose data. Node 4 Node 5 Node 6 Copysets: {1, 5, 6}, {2, 6, 8} Node 7 Node 8 Node 9

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets: {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 2, 7}, {1, 2, 8}, … Node 7 Node 8 Node 9

Random Replication Causes Frequent Data Loss Random replication eventually creates maximum number of copysets Any combination of 3 nodes = 84 copysets If 3 nodes fail, 100% probability of data loss

MinCopysets Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8

MinCopysets Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets: {1, 5, 7}, {2, 4, 9}, {3, 6, 8} Node 7 Node 8 Node 9

MinCopysets Minimizes Data Loss Frequency MinCopysets creates minimum number of copysets Only {1, 5, 7}, {2, 4, 9}, {3, 6, 8} If 3 nodes fail, 3.5% of data loss

MinCopysets Reduces Probability of Data Loss In terms of reducing the probability of data loss this more or less eliminiates the problem But it does have two major disadvantages

Facebook, LinkedIn, NetApp, Google The Trade-off MinCopysets Random Replication Mean Time to Failure 625 years 1 year Amount of Data Lost 1 TB 5.5 GB Many system designers prefer this trade-off, including Facebook and LinkedIn 5000-node cluster Power outage occurs every year Confirmed by: Facebook, LinkedIn, NetApp, Google

Problem: MinCopysets Increases Single Node Recovery Time With random replication since you sprayed your data across a whole bunch of nodes you can reconsitute it very quickly Whereas with MinCopysets you more or less have to copy all the data from two nodes

Facebook Extension to HDFS Choose random Many HDFS have noticed this and worked around it Here’s an example Buddy Group

A Compromise XXX – Facebook extension to HDFS

Can We Do Better? Facebook extension to HDFS

Definition: Scatter Width MinCopysets Scatter Width = 2 Facebook Extension to HDFS Scatter Width = 10

Facebook Extension to HDFS 1 2 3 4 5 6 7 8 9 Node 1’s copysets: {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5} Overall: 54 copysets If 3 nodes fail simultaneously: Buddy group Maybe graphical? Not introduce terminology Show a sliding window

Copyset Replication: Intuition Same scatter width (4), different scheme: {1, 2, 3}, {4, 5, 6}, {7, 8, 9} {1, 4, 7}, {2, 5, 8}, {3, 6, 9} Ingredients of ideal scheme Maximize scatter width Minimize overlaps 1 2 3 If you carefully place the chunks so they create a smaller number of copysets, you can get for the same scatter width you can get much better This paper explores this notion of rather than randomly spraying data you get the most reliable and fastest recovery time 4 7

Copyset Replication: Initialization 1 2 3 4 5 6 7 8 9 Random Permutation 7 3 5 6 2 9 1 8 4 It’s a very complex problem to solve this optimally. In the paper we present a heuristic scheme, which we call Copyset Replication Split into copysets (Scatter width = 2) 7 3 5 6 2 9 1 8 4 Copyset Copyset Copyset

Copyset Replication: Initialization 1 2 3 4 5 6 7 8 9 Permutation 1: Scatter width = 2 7 3 5 6 2 9 1 8 4 Permutation 2: Scatter width = 4 9 7 1 5 6 8 4 2 3 … Permutation 5: Scatter width = 10

Copyset Replication: Replication 1 2 3 4 5 6 7 8 9 Randomly choose copyset 7 3 5 6 2 9 1 8 4 9 7 1 5 6 8 4 2 3 …

Insignificant Overhead Add the Copyset Replication – show the overhead

Copyset Replication

Inherent Trade-off

Related Work BIBD (Balanced Incomplete Block Designs) Originally proposed for designing agricultural experiments in the 1930’s! [Fisher, ’40] Other applications Power downs [Harnik et al ’09, Leverich et al ’10, Thereska ’11] Multi-fabric interconnects [Mehra, ’99]

Summary Many storage systems randomly spray their data across a large number of nodes Serious problem with correlated failures Copyset Replication is a better way of spraying data that decreases the probability of correlated failures

Thank You! Stanford University Intro: We initially designed a replication system for RAMCloud called MinCopysets and gave a talk on it We saw that it doesn’t translate well to disk based systems because it impacts their node recovery We then designed a replication system that Stanford University

More Failures (Facebook)

RAMCloud

HDFS