Persistence of Data in a Dynamic Unreliable Network

Slides:



Advertisements
Similar presentations
What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Advertisements

The google file system Cs 595 Lecture 9.
Storage management and caching in PAST Antony Rowstron and Peter Druschel Presented to cs294-4 by Owen Cooper.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
1 High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two Nov. 24, 2003 Byung-Gon Chun.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Object Naming & Content based Object Search 2/3/2003.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
September 24, 2007The 3 rd CSAIL Student Workshop Byzantine Fault Tolerant Cooperative Caching Raluca Ada Popa, James Cowling, Barbara Liskov Summer UROP.
Farsite: Ferderated, Available, and Reliable Storage for an Incompletely Trusted Environment Microsoft Reseach, Appear in OSDI’02.
Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation.
Wide-area cooperative storage with CFS
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
Replica Placement Strategy for Wide-Area Storage Systems Byung-Gon Chun and Hakim Weatherspoon RADS Final Presentation December 9, 2004.
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
1 The Google File System Reporter: You-Wei Zhang.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Presenters: Rezan Amiri Sahar Delroshan
Using Model Checking to Find Serious File System Errors StanFord Computer Systems Laboratory and Microsft Research. Published in 2004 Presented by Chervet.
Serverless Network File Systems Overview by Joseph Thompson.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI’ 06 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Peer to Peer Network Design Discovery and Routing algorithms
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
THE EVOLUTION OF CODA M. Satyanarayanan Carnegie-Mellon University.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Module 11 Configuring and Managing Distributed File System.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Introduction to Load Balancing:
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Making the Archive Real
Plethora: Infrastructure and System Design
SKIP GRAPHS James Aspnes Gauri Shah SODA 2003.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Lecture 20 LFS.
Fault Tolerance Distributed Web-based Systems
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Presentation transcript:

Persistence of Data in a Dynamic Unreliable Network Fastest Flaky Slower Slow Faster Stable Fast Slowtest 10 .. 100s GB/Node of Idle Cheap Disk Distributed Data Store w/ all the *ilities: High Availability Good Scalability High Reliability Maintainability Flexibility Reliable Substrate Presented by Rachel Rubin and Hakim Weatherspoon CS294-4: Peer-to-Peer Systems

©2003 Rachel Rubin and Hakim Weatherspoon Outline Motivation/Desires Reliable distributed data store with dynamic members Harness aggregate power of system. Questions about data store How data is structured How to access data Amount of resources used to keep data durable? Storage? Bandwidth? Branching Cost of maintaining redundancy Optimized implementation Conclusion P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon The Data Object Data Blocks VGUIDi VGUIDi + 1 d2 d4 d3 d8 d7 d6 d5 d9 d1 B -Tree Indirect M d'8 d'9 back pointer copy on write AGUID = hash{name+keys} GUID = cryptographically secure hash of data. That is, data is immutable/read-only GUID allows any node to store data Red arrow is hard P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Mutable Data Need mutable data for real system. Entity in network. A-GUID to V-GUID mapping. Point of serialization for integrity Atomically applies update. Versioning system Each version is inherently read-only. End result, complex objects w/ mutability. Trail of versions. Aguid that ties VGUIDs to an AGUID. Pointer to head of data is tricky. Limitation/Motivate need heartbeats (signed map of Aguid to vguid). Verifies client privileges. Atomically applies update. P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

Branching and Versioning Modifying old versions of the data Possible conflict merging modified old version with the current head Branching Provides different data threads Makes time-travel more functional Multiple data threads Operational Defer Conflicts Not abort updates Disconnected operations P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

Overview of macrobranching Each branch is treated as its own object Branches are created from the main object P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Macrobranching Story AGUID4 AGUID2 AGUID1 Time AGUID3 P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

Macrobranching Details Writing Create branch in serializer Mark in main branch metadata new branch creation Mark in new branch which object and version it was created from New AGUID needs to be managed Reading From new AGUID Close Can no longer write to branch Merge with main branch if specified Recovery P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

Application: NFS w/ Branching and Time Travel Accessing data from a point in the past and modifying from there Directories can be rolled back and modified Modifications are not automatically the main branch head Organizationally more clean P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Dynamic Access Access data reliably in a dynamic network. How much does this cost? P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon DHT Advantages Spread storage burden evenly Avoid hot spots Tolerate unreliable participants O(log N) algorithm Simple DHT automatically and autonomously maintains data Decides who, what, when, where, why, and how P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Basic Assumptions P2P Purist Ideals Cooperation, Symmetry, Decentralized DHT Assumptions Simple redundancy maintenance mechanisms enter and exit Static data placement strategy (f: RB-> N) Identical per-node space and bandwidth contributions Constant rate of entering and exiting. Independence of exit events Constant steady-state number of nodes and total data size Maintenance bandwidth Average case analysis P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Basic Cost Model N: number of hosts D: data S: data + redundancy (S = kD) : entering rate : exiting rate ( = ) T: lifetime (T=N/) B: bandwidth : Membership timeout distinguish true departures from temporary downtime delay its response to failures a: availability Hosts serve as a fraction of time More redundancy is needed Effective bandwidth is reduced P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

BW for Redundancy Maintenance maintenance BW  200 Kbps lifetime = Median 2001-Gnutella session = 1 hour served space = 90 MB/node << donatable storage! P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

Need Too Much BW to Maintain Redundancy High Availability Scalable Storage Must Pick Two Dynamic Membership Wait! It gets worse… HW trends P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Hardware Trends The picture only gets worse Participation should be more stable to contribute meaningful fraction of disks P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

Solution  Indirection Distributed directory (DD) Uses a level of indirection Decouples networking layer from data layer Controls the data placement Exploits heterogeneity (availability, lifetime, and bandwidth) P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

Models for Comparison I: DHT vs DD Data extracted from [Bhagwan, Savage, and Voelker 2003] P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

Models for Comparison II: DHT vs DD Reliable nodes Greater than 70% availability Model 1 DHT Model 2 Reliable nodes (Model 2.a): store data and DD pointers. Unreliable nodes (Model 2.b): store DD pointers only Model 3 Reliable nodes (Model 3.a): store all data and DD pointers. Unreliable nodes (Model 3.b): do nothing (I.e. free loader) P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon BW/N vs Lifetime P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon BW/N vs Data/Ptr P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Replication vs Coding P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Problems with DD I Ratio of Data to Pointer Need D/P > kp m Memory Leaks No pointer to data Solved with redundancy in pointers Dangling Pointers Node is dead Node removed data but not pointer Solved with with heartbeats P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Problems with DD II Heartbeats Freshness/accuracy vs bandwidth Routing using pointers Infrastructure vs application Complexity Need to decide who, what, where, when, why, and how to maintain redundancy P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Efficient Heartbeats Exploit locality properties of Pastry/Tapestry. Efficient detection, but slow. O(N) host knowledge If objects per node > N 000 001 010 011 100 101 110 111 0** 1** 00* 01* 10* 11* 0** 1** 00* 01* 10* 11* 0** 1** Global Sweep and repair not efficient. Want detection of node removal in system. Efficient detection and then reconstruction of fragments. Detection is not efficient. System should automatically: Adapt to failure. Repair itself. Incorporate new elements. Can we guarantee data is available for 1000 years? New servers added from time to time Old servers removed from time to time Everything just works Many components with geographic separation System not disabled by natural disasters Can adapt to changes in demand and regional outages Gain in stability through statistics P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon When  Triggers Root knows redundancy level Threshold to trigger repair Routing to object Need infrastructure CE42 4432 3A40 1010 L1 L2 L3 L4 0128 B4F8 2218 9598 3598 4598 9098 Root 7598 Fragment-1 Fragment-2 0325 Client Global Sweep and repair not efficient. Want detection of node removal in system. Efficient detection and then reconstruction of fragments. Detection is not efficient. System should automatically: Adapt to failure. Repair itself. Incorporate new elements. Can we guarantee data is available for 1000 years? New servers added from time to time Old servers removed from time to time Everything just works Many components with geographic separation System not disabled by natural disasters Can adapt to changes in demand and regional outages Gain in stability through statistics P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon

©2003 Rachel Rubin and Hakim Weatherspoon Conclusions Immutable data assist in Secure read-only data and caching infrastructures Continuous adaptation and repair DHTs do NOT Consider suitability of a peer for a specific task before delegating the task to the peer Differentiating between (un)reliable saves bw. Savings increase as gap widens (e.g. reliability gap) Distributed Directory utilizes reliable nodes Need Data/Ptr > 10,000 Must prevent memory leaks with pointer redundancy Dangling pointers with heartbeats Heartbeats O(N) host knowledge P2P Systems 2003 ©2003 Rachel Rubin and Hakim Weatherspoon