Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI’ 06 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

One Hop Lookups for Peer-to-Peer Overlays Anjali Gupta, Barbara Liskov, Rodrigo Rodrigues Laboratory for Computer Science, MIT.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen Scott Shenker This is a modified version of the original presentation by the authors.
Scalable Content-Addressable Network Lintao Liu
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel Presented by: Cristian Borcea.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Fabián E. Bustamante, Fall 2005 Efficient Replica Maintenance for Distributed Storage Systems B-G Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon,
SplitStream by Mikkel Hesselager Blanné Erik K. Aarslew-Jensen.
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up.
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up protocol.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Overlay Networks + Internet routing has exhibited scalability - Internet routing is inefficient -Difficult to add intelligence to Internet Solution: Overlay.
1 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan.
A New Adaptive FEC Loss Control Algorithm for Voice Over IP Applications Chinmay Padhye, Kenneth Christensen and Wilfirdo Moreno Department of Computer.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
1 High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two Nov. 24, 2003 Byung-Gon Chun.
On Object Maintenance in Peer-to-Peer Systems IPTPS 2006 Kiran Tati and Geoffrey M. Voelker UC San Diego.
X Non-Transitive Connectivity and DHTs Mike Freedman Karthik Lakshminarayanan Sean Rhea Ion Stoica WORLDS 2005.
Positive Feedback Loops in DHTs or Be Careful How You Simulate January 13, 2004 Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz From “Handling.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Scalable Adaptive Data Dissemination Under Heterogeneous Environment Yan Chen, John Kubiatowicz and Ben Zhao UC Berkeley.
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
A Scalable Content- Addressable Network Sections: 3.1 and 3.2 Καραγιάννης Αναστάσιος Α.Μ. 74.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Dept. of Computer Science & Engineering, CUHK Fault Tolerance and Performance Analysis in Wireless CORBA Chen Xinyu Supervisor: Markers: Prof.
Wide-area cooperative storage with CFS
Replica Placement Strategy for Wide-Area Storage Systems Byung-Gon Chun and Hakim Weatherspoon RADS Final Presentation December 9, 2004.
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
Focus on Distributed Hash Tables Distributed hash tables (DHT) provide resource locating and routing in peer-to-peer networks –But, more than object locating.
GRID COMPUTING: REPLICATION CONCEPTS Presented By: Payal Patel.
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.
Cooperative File System. So far we had… - Consistency BUT… - Availability - Partition tolerance ?
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Presentation 1 By: Hitesh Chheda 2/2/2010. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
The concept of RAID in Databases By Junaid Ali Siddiqui.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
A Sybil-Proof Distributed Hash Table Chris Lesniewski-LaasM. Frans Kaashoek MIT 28 April 2010 NSDI
1 FairOM: Enforcing Proportional Contributions among Peers in Internet-Scale Distributed Systems Yijun Lu †, Hong Jiang †, and Dan Feng * † University.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Peer-to-peer Systems All slides © IG.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
A Tale of Two Erasure Codes in HDFS
Persistence of Data in a Dynamic Unreliable Network
Magdalena Balazinska, Hari Balakrishnan, and David Karger
(slides by Nick Feamster)
Plethora: Infrastructure and System Design
Authors Alessandro Duminuco, Ernst Biersack Taoufik and En-Najjary
SCOPE: Scalable Consistency in Structured P2P Systems
Student: Fang Hui Supervisor: Teo Yong Meng
Presentation transcript:

Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI’ 06 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris MIT Computer Science and Artificial Intelligence Laboratory, Rice University/MPI-SWS, University of California, Berkeley Speaker: Szu-Jui, Wu

Definitions Distributed Storage System: a network file system whose storage nodes are dispersed over the Internet Durability: objects that an application has put into the system are not lost due to disk failure Availability: get will be able to return the object promptly 2015/12/82

Motivation To store immutable objects durably at a low bandwidth cost in a distributed storage system 2015/12/83

Contributions A set of techniques that allow wide-area systems to efficiently store and maintain large amounts of data An implementation: Carbonite 2015/12/84

Outline Motivation Understanding durability Improving repair time Reducing transient failure cost Implementation Issues Conclusion 2015/12/85

Understanding Durability

Providing Durability Durability is relatively more important than availability Challenges – Replication algorithm: Create new replica faster than losing them – Reducing network bandwidth – Distinguish transient failures from permanent disk failures – Reintegration 2015/12/87

Challenges to Durability Create new replicas faster than replicas are destroyed – Creation rate < failure rate  system is infeasible Higher number of replicas do not allow system to survive a higher average failure rate – Creation rate = failure rate + ε (ε is small)  burst of failure may destroy all of the replicas 2015/12/88

Number of Replicas as a Birth-Death Process Assumption: independent exponential inter-failure and inter-repair times λ f : average failure rate μi: average repair rate at state i r L : lower bound of number of replicas ( r L = 3 in this case) 2015/12/89

Model Simplification Fixed μ and λ  the equilibrium number of replicas is Θ = μ/ λ If Θ < 1, the system can no longer maintain full replication regardless of r L 2015/12/810

Real-world Settings Planetlab – 490 nodes – Average inter-failure time hours – 150 KB/s bandwidth Assumption – 500 GB per node – r L = 3 λ = 365 day / 490 x (39.85 / 24) = disk failures / year μ = 365 day / (500 GB x 3 / 150 KB/sec) = 3 disk copies / year Θ = μ/ λ = /12/811

Impact of Θ Θ is the theoretical upper limit of replica number bandwidth ↑  μ ↑  Θ ↑ r L ↑  μ ↓  Θ ↓ 2015/12/812 Θ = 1

Choosing r L Guidelines – Large enough to ensure durability – One more than the maximum burst of simultaneous failures – Small enough to ensure r L <= Θ 2015/12/813

r L vs Durablility Higher r L would cost high but tolerate more burst failures Larger data size  λ ↑  need higher r L Analytical results from Planetlab traces (4 years) 2015/12/814

Improving Repair Time

Definition: Scope Each node, n, designates a set of other nodes that can potentially hold copies of the objects that n is responsible for. We call the size of that set the node’s scope. scope є [r L, N] – N: number of nodes in the system 2015/12/816

Effect of Scope Small scope – Easy to keep track of objects – More effort of creating new objects Big scope – Reduces repair time, thus increases durability – Need to monitor many nodes – Increase the likelihood of simultaneous failures 2015/12/817

Scope vs. Repair Time Scope ↑  repair work is spread over more access links and completes faster r L ↓  scope must be higher to achieve the same durability 2015/12/818

Reducing Transient Costs

The Reasons Not creating new replicas for transient failures – Unnecessary costs (replicas) – Waste resources (bandwidth, disk) Solutions – Timeouts – Reintegration – Batch 2015/12/820

Timeouts Timeout > average down time – Average down time: 29 hours – Reduce maintenance cost – Durability still maintained 2015/12/821 Timeout >> average down time – Durability begins to fall – Delays the point at which the system can begin repair

Reintegration Reintegrate replicas stored on nodes after transient failures System must be able to track more than r L number of replicas Depends on a: the average fraction of time that a node is available 2015/12/822

Effect of Node Availability Pr[new replica needs to be created] == Pr[less than r L replicas are available] : Chernoff bound: 2r L /a replicas are needed to keep r L copies available 2015/12/823

Node Availability vs. Reintegration Reintegrate can work safely with 2r L /a replicas 2/a is the penalty for not distinguishing transient and permanent failures r L = /12/824

Four Replication Algorithms Cates – Fixed number of replicas r L – Timeout Total Recall – Batch Carbonite – Timeout + reintegration Oracle – Hypothetical system that can differentiate transient failures from permanent failures 2015/12/825

Effect of Reintegration 2015/12/826

Batch In addition to r L replicas, make e additional copies – Makes repair less frequent – Use up more resources r L = /12/827

Implementation Issues

DHT vs. Directory-based Storage Systems DHT-based: consistent hashing an identifier space Directory-based : use indirection to maintain data, and DHT to store location pointers 2015/12/829

Node Monitoring for Failure Detection Carbonite requires that each node know the number of available replicas of each object for which it is responsible The goal of monitoring is to allow the nodes to track the number of available replicas 2015/12/830

Monitoring consistent hashing systems  Each node maintains, for each object, a list of nodes in the scope without a copy of the object  When synchronizing, a node n provide key k to node n’ who missed an object with key k, prevent n’ from reporting what n already knew 2015/12/831

Monitoring host availability DHT’s routing tables uses the spanning tree rooted at each node a O(logN) out-degree Multicast heartbeat message of each node to its children nodes periodically When heartbeat is missed, monitoring node triggers repair actions 2015/12/832

Conclusion Many design choices remain to be made – Number of replicas (depend on failure distribution and bandwidth, etc) – Scope size – Response to transient failures Reintegration (extra copies #) Timeouts (timeout period) Batch (extra copies #) 2015/12/833