Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Slides:

Advertisements

Similar presentations

MinCopysets: Derandomizing Replication in Cloud Storage

Advertisements

One Hop Lookups for Peer-to-Peer Overlays Anjali Gupta, Barbara Liskov, Rodrigo Rodrigues Laboratory for Computer Science, MIT.

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Rarest First and Choke Algorithms are Enough Arnaud LEGOUT INRIA, Sophia Antipolis France G. Urvoy-Keller and P. Michiardi Institut Eurecom France.

Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.

© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove.

Fabián E. Bustamante, Fall 2005 Efficient Replica Maintenance for Distributed Storage Systems B-G Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon,

Availability in Globally Distributed Storage Systems

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

1 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan.

1 High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two Nov. 24, 2003 Byung-Gon Chun.

Placement of Continuous Media in Wireless Peer-to-Peer Network Shahramram Ghandeharizadeh, Bhaskar Krishnamachari, and Shanshan Song IEEE Transactions.

Detecting Network Intrusions via Sampling : A Game Theoretic Approach Presented By: Matt Vidal Murali Kodialam T.V. Lakshman July 22, 2003 Bell Labs, Lucent.

On Object Maintenance in Peer-to-Peer Systems IPTPS 2006 Kiran Tati and Geoffrey M. Voelker UC San Diego.

DNA Research Group 1 Growth Codes: Maximizing Sensor Network Data Persistence Abhinav Kamra, Vishal Misra, Dan Rubenstein Department of Computer Science,

Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.

Distributed Cluster Repair for OceanStore Irena Nadjakova and Arindam Chakrabarti Acknowledgements: Hakim Weatherspoon John Kubiatowicz.

Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?

Erasure Coding vs. Replication: A Quantiative Comparison

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Replica Placement Strategy for Wide-Area Storage Systems Byung-Gon Chun and Hakim Weatherspoon RADS Final Presentation December 9, 2004.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.

On Self Adaptive Routing in Dynamic Environments -- A probabilistic routing scheme Haiyong Xie, Lili Qiu, Yang Richard Yang and Yin Yale, MR and.

Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

Distributed Load Balancing for Key-Value Storage Systems Imranul Hoque Michael Spreitzer Malgorzata Steinder.

Network Aware Resource Allocation in Distributed Clouds.

1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Growth Codes: Maximizing Sensor Network Data Persistence abhinav Kamra, Vishal Misra, Jon Feldman, Dan Rubenstein Columbia University, Google Inc. (SIGSOMM’06)

QoS Routing in Networks with Inaccurate Information: Theory and Algorithms Roch A. Guerin and Ariel Orda Presented by: Tiewei Wang Jun Chen July 10, 2000.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

Queueing and Active Queue Management Aditya Akella 02/26/2007.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

The concept of RAID in Databases By Junaid Ali Siddiqui.

Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI’ 06 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

Video Caching in Radio Access network: Impact on Delay and Capacity

1 Sheer volume and dynamic nature of video stresses network resources PIE: A lightweight latency control to address the buffer problem issue Rong Pan,

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

OverQos: An Overlay based Architecture for Enhancing Internet Qos L Subramanian*, I Stoica*, H Balakrishnan +, R Katz* *UC Berkeley, MIT + USENIX NSDI’04,

CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.

SketchVisor: Robust Network Measurement for Software Packet Processing

A Tale of Two Erasure Codes in HDFS

Problem: Internet diagnostics and forensics

Persistence of Data in a Dynamic Unreliable Network

Distributed Network Traffic Feature Extraction for a Real-time IDS

Vivaldi: A Decentralized Network Coordinate System

(slides by Nick Feamster)

Authors Alessandro Duminuco, Ernst Biersack Taoufik and En-Najjary

COS 518: Advanced Computer Systems Lecture 11 Michael Freedman

Providing Secure Storage on the Internet

StreamApprox Approximate Stream Analytics in Apache Spark

Symmetric Allocations for Distributed Storage

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Scheduling Algorithms in Broad-Band Wireless Networks

COS 518: Advanced Computer Systems Lecture 12 Michael Freedman

Presentation transcript:

Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris

About the Paper 137 Citations (Google Scholar) In Proceedings of the 3rd conference on Networked Systems Design \& Implementation - Volume 3 (NSDI'06), Vol. 3. USENIX Association, Berkeley, CA, USA, 4-4. Research of OceanStore’s Project 2

Credit Modified version of MfDSS.ppt 3

Outline Motivation Understanding durability Improving repair time Reducing transient costs Implementation Conclusion 4

Related terms Distributed Storage Systems Storage systems that aggregate the disks of many nodes scattered throughout the Internet 5

Motivation One of the most important goals of distributed storage system: Robust Solution? -> replication put Durability: objects that an application has put into the system are not lost due to disk failure get Availability: get will be able to return the object promptly 6

Motivation Failure Transient failure: availability Out of power Scheduled maintenance Network problem Permanent failure: durability Disk failure 7

Contribution Develop and Implement an algorithm Carbonite to store immutable objects durably and at a low bandwidth cost in a distributed storage systems Create replica fast enough to handle failure Keep track of all the replicas Use a model to determine a reasonable # of replications 8

Outline Motivation Understanding durability Improving repair time Reducing transient costs Implementation Conclusion 9 Model of the relationship between: Network capacity Amount of replicated data Number of replicas Durability

Providing Durability Durability is more practical and useful than availability Challenge to durability Create new replicas faster than losing them Reduce network bandwidth Distinguish transient failures from permanent disk failures Reintegration 10

Challenges to Durability Create new replicas faster than replicas are destroyed Creation rate < failure rate  system is infeasible Higher number of replicas do not allow the system to survive a high average failure rate Creation rate = failure rate + ε (ε is small)  bursts of failures may destroy all of the replicas 2015/6/22 11

Number of Replicas as a Birth-Death Process Assumption: independent exponential inter-failure and inter-repair times λ f : average failure rate μ i : average repair rate at state i r L : lower bound of number of replicas, is the target number of replicas in order to survive bursts of failures 2015/6/22 12

Model Simplification Fixed μ and λ  the equilibrium number of replicas is Θ = μ/ λ Higher values of Θ decrease the time it takes to repair an object 2015/6/22 13

Example PlanetLab 490 nodes Average inter-failure time hours 150 KB/s bandwidth Assumption 500 GB per node r L = 3 λ = 365 day / 490 * (39.85 / 24) = disk failures / year μ = 365 day / (500 GB * 3 / 150 KB/sec) = 3 disk copies / year Θ = μ/ λ = /6/22 PlanetLab A large research testbed for computer networking and distributed system research with nodes located around the world Different organizations each donate one or more computers 14 λ depends on # of nodes Frequency of failure μ depends on Amount of data Bandwidth Θ = μ/ λ = constant * bandwidth*#nodes* inter-failure time /(amount of data*R L )

Impact of Θ 2015/6/22 Θ = 1 15 Bandwith ↑  μ ↑  Θ ↑ r L ↑  μ ↓  Θ ↓ Θ is the theoretical upper limit of extra replica number If Θ < 1, the system can no longer maintain full replication regardless of r L Θ = 1 Θ = constant * bandwidth/R L

Choosing r L Guidelines Large enough to ensure durability At least one more than the maximum burst of simultaneous failures Small enough r L, if a low value of r L would suffice, the bandwidth maintaining the extra replicas would be wasted. 2015/6/22 16

r L vs Durablility Higher r L means high cost but tolerates more bursts of failures Larger data size  λ ↑  need higher r L Analytical results from PlanetLab traces (4 years) 2015/6/22 17

Outline Motivation Understanding durability Improving repair time Reducing transient costs Implementation Conclusion 18 Proper placement of replicas on servers

Node’s Scope Defintion: Each node, n, designates a set of other nodes that can potentially hold copies of the objects that n is responsible for. We call the size of that set the node’s scope. scope є [r L, N] N: total number of nodes in the system 2015/6/22 19

Effect of Scope Small scope Easy to keep track of objects Needs more time to create new object Big scope Reduces repair time, thus increases durability Needs to monitor more nodes If large number of objects and random placement, may increase the likelihood of simultaneous failures 2015/6/22 20

Scope vs. Repair Time Scope ↑  repair work is spread over more access links and completes faster r L ↓  scope must be higher to achieve the same durability 2015/6/22 21

Outline Motivation Understanding durability Improving repair time Reducing transient costs Implementation Conclusion 22 Reduce bandwidth wasted due to the transient failure How to distinguish the failure, i.e., transient and permanent failure. Reintegrate replicas stored on nodes after transient failures.

Reintegration Reintegrate replicas stored on nodes after transient failures System must be able to track all the replicas 2015/6/22 23

Effect of Node Availability a: the average fraction of time that a node is available Pr [new replica needs to be created] = Pr [less than r L replicas are available] : Chernoff bound: 2r L /a replicas are needed to keep r L copies available 2015/6/22 24

Node Availability vs. Reintegration Reintegration can work safely with 2r L /a replicas 2/a is the penalty for not being able to distinguishing transient and permanent failures r L = /6/22 25

Create replicas as needed Probability of making replicas depends on a Estimate a? 2015/6/22 26

Timeouts A heuristic to avoid misclassifying temporary failures as permanent setting a timeout can reduce response to transient failures but its success depends greatly on its relationship to the downtime distribution and can in some instances reduce durability as well. 27

Four Replication Algorithms Cates Fixed number of replicas r L Timeout Total Recall Batch (In addition to r L replicas, make e additional copies so that can make repair less frequent) Carbonite Timeout + reintegration Oracle Hypothetical system that can differentiate transient failures from permanent failures 2015/6/22 28

Comparison 2015/6/22 29

Outline Motivation Understanding durability Improving repair time Reducing transient costs Implementation Conclusion 30 Challenges Like Node Monitoring

Node Monitoring for Failure Detection Carbonite requires that each node know the number of available replicas of each object for which it is responsible The goal of monitoring is to allow the nodes to track the number of available replicas Challenges: Monitoring consistent hashing systems Monitoring host availability 2015/6/22 31

Outline Motivation Understanding durability Improving repair time Reducing transient costs Implementation Conclusion 32

Conclusion Described a set of techniques that allow wide- area systems to efficiently store and maintain large amounts of data Keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system. 2015/6/22 33

34 Questions or Comments? Thanks!