OceanStore: An Architecture for Global- Scale Persistent Storage.

Slides:



Advertisements
Similar presentations
What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Advertisements

Henry C. H. Chen and Patrick P. C. Lee
Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
David Choffnes, Winter 2006 OceanStore Maintenance-Free Global Data StorageMaintenance-Free Global Data Storage, S. Rhea, C. Wells, P. Eaton, D. Geels,
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Overview Distributed vs. decentralized Why distributed databases
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter: Chunyuan Liao March 6, 2002 Ben Y.Zhao, John Kubiatowicz, and.
OSD Metadata Management
The Oceanic Data Utility: (OceanStore) Global-Scale Persistent Storage John Kubiatowicz.
Object Naming & Content based Object Search 2/3/2003.
OceanStore: Data Security in an Insecure world John Kubiatowicz.
OceanStore Theoretical Issues and Open Problems John Kubiatowicz University of California at Berkeley.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Wide-area cooperative storage with CFS
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
Two-Tier Architecture of OSD Metadata Management Xianbo Zhang, Keqiang Wu 11/11/2002.
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
1 The Google File System Reporter: You-Wei Zhang.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000.
Jan 17, 2001CSCI {4,6}900: Ubiquitous Computing1 Announcements I will be out of town Monday and Tuesday to present at Multimedia Computing and Networking.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
Distributed Architectures. Introduction r Computing everywhere: m Desktop, Laptop, Palmtop m Cars, Cellphones m Shoes? Clothing? Walls? r Connectivity.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Databases Illuminated
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Freenet “…an adaptive peer-to-peer network application that permits the publication, replication, and retrieval of data while protecting the anonymity.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Peer to Peer Network Design Discovery and Routing algorithms
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage
OceanStore: An Architecture for Global-Scale Persistent Storage
CHAPTER 3 Architectures for Distributed Systems
Plethora: Infrastructure and System Design
Accessing nearby copies of replicated objects
Providing Secure Storage on the Internet
OceanStore: Data Security in an Insecure world
OceanStore: An Architecture for Global-Scale Persistent Storage
Mid term grades Mean = 48.59, Median = 48.5, Min = 40, Max = 56.
Review Stateless (NFS) vs Statefull (AFS)
Content Distribution Network
Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.
Presentation transcript:

OceanStore: An Architecture for Global- Scale Persistent Storage

Introduction Vision: ubiquitous computing devices Goal: transparency Where to store persistent information? How to protect against system failures? How to upgrade components without losing configuration info? How to manage consistency?

Introduction Requirements Intermittent connectivity Secure from theft and denial-of-service Durable information Automatic and reliable archival services Information divorced from location Geographically distributed servers Caching close to clients Information can migrate to wherever it is needed Scale: users, each with 10,000 files

OceanStore: A True Data Utility Utility model: consumers pay a monthly fee in exchange for access to persistent storage Highly available data from anywhere Automatic replication for disaster recovery Strong security Providers would buy and sell capacity among themselves for mobile users Deep archival storage: use excess of storage space to ease data management

Two Unique Goals Use untrusted infrastructure May crash without warning Encrypted information in the infrastructure Responsible party is financially responsible for the integrity of data Support nomadic data Data can be cached anywhere, anytime Continuous introspective monitoring to manage caching and locality

System Overview The fundamental unit in OceanStore: a persistent object Named by a globally unique identifier (GUID) Replicated and stored on multiple servers Independent of the server (floating replicas) Two mechanisms to locate a replica Probabilistically probe neighboring machines Slower deterministic algorithm

OceanStore Updates Each update (or groups of updates) to an object creates a new version Consistency is based on versioning No need for backup Pointers are permanent

OceanStore Objects An active object is the latest version of its data An archival object is a permanent, read-only version of the object Encoded with an erasure code Any m out of n fragments can reconstruct the original data Can support either weak or strong consistency models

Applications Groupware: calendar, , contact lists, distributed design tools Allow concurrent updates Provide ways to merge information and detect conflicts

Applications Digital libraries Require massive quantities of storage Replication for durability and availability Deep archival storage to survive disaster Seamless migration of data to where it is needed Sensor data aggregation and dissemination

Naming GUID: pseudo-random fixed-length bit string Naming facility Decentralized Self-certifying path names GUID = hash(user key, file name) Multiple roots in OceanStore GUID of a server is a secure hash of its key GUID of a data fragment is a secure hash of the data content

Access Control Reader restriction Encrypt all data Revocation Delete all replicas Encrypt all replicas with a new key A server can use old keys to access cached old data

Access Control Writer restriction Writes are signed Reads are restricted at clients Writes are restricted at servers

Data Location and Routing Objects can reside on any of the OceanStore servers Use query routing to locate objects

Distributed Routing in OceanStore Every object is identified by one or more GUIDs Different replicas of the same object has the same GUID OceanStore messages are labeled with A destination GUID (built on top of IP) A random number A small predicate

Bloom Filters Based on the idea of hill-climbing If a query cannot be satisfied by a server, local information is used to route the query to a likely neighbor Via a modified version of a Bloom filter

Bloom Filter A Bloom filter Represents a set S = {S 1, … S n } Is depicted by a m bit array, filter[m] Uses r independent hash functions h 1 …h r for i = 1…n for j = 1…r filter[h j [S i ]] = 1

Insertion Example m = 6, r = 3 To insert word x h 1 (x) = 0 h 2 (x) = 3 h 3 (x) = 5 filter[] = {1, 0, 0, 1, 0, 1}

Insertion Example m = 6, r = 3 To insert word y h 1 (y) = 1 h 2 (y) = 3 h 3 (y) = 5 filter[] = {1, 1, 0, 1, 0, 1}

Testing Example filter[] = {1, 1, 0, 1, 0, 1} Does x belong to the set? filter[h 1 (x)] = filter[0] = 1 filter[h 2 (x)] = filter[3] = 1 filter[h 3 (x)] = filter[5] = 1 Does z belong to the set? filter[h 1 (z)] = filter[2] = 0  no filter[h 2 (z)] = filter[3] = 1 filter[h 3 (z)] = filter[5] = 1

False Positives If filter[i] = 0, it’s not in S If filter[i] = 1, it’s probably in S False positive rate depends on Number of hash functions Array size Number of unique elements in S

Attenuated Bloom Filters An attenuated Bloom filter of depth D is an array of D normal Bloom filters ith Bloom filter is the union of all the Bloom filters for all of the nodes at a distance i One filter per network edge

Attenuated Bloom Filters Lookup 11010

The Global Algorithm: Wide-Scale Distributed Data Location Plaxton’s randomized hierarchical distributed data structure Resolve one digit of the node id at a time

The Global Algorithm: Wide-Scale Distributed Data Location

Achieving Locality Each new replica only needs to traverse O(log(n)) hops to reach the root, where n is the number of the servers

Achieving Fault Tolerance Avoid failures at roots Each root GUID is hashed with a small number of different salt values Make it difficult to target a single GUID for DoS attacks If failures are detected, just jump to any node to reach the root OceanStore continually monitors and repairs broken pointers

Advantages of Distributed Information Redundant paths to roots Scalable with a combination of probabilistic and global algorithms Easy to locate and recover failed components Plaxton links form a natural substrate for admission controls and multicasts

Achieving Maintenance-Free Operation Recursive node insertion and removal Replicated roots Use beacons to detect faults Time-to-live fields to update routes Second-chance algorithm to avoid false diagnoses of failed components Avoid the cost of recovering lost nodes Automatic reconstruction of data for failed servers

Update Model Conflict resolution update model Challenge: Untrusted infrastructure Access only to ciphertext

Update Format and Semantics An update: a list of predicates associated with actions If any of the predicates evaluates to be true, the actions associated with the earliest true predicate are atomically applied Everything is logged

Extending the Model to Work over Ciphertext Supported predicates Compare version (unencrypted metadata) Compare size (unencrypted metadata) Compare block Compare a hash of the encrypted block Search Returns only yes/no Cannot be initiated by the server Replace/insert/delete/append block

Serializing Updates in an Untrusted Infrastructure Use a small primary tier of replicas to serialize updates Minimize communication Meanwhile, a secondary tier of replicas optimistically propagate updates among themselves Final ordering from primary tier is multicasted to secondary replicas

A Direct Path to Clients and Archival Storage Updates flow directly from a client to the primary tier, where they are serialized and then multicast to the secondary servers Updates are tightly coupled with archival Archival fragments are generated at serialization time and distributed with updates

Efficiency of the Consistency Protocol For updates > 4Kbytes, network overhead < 100% Approximate latency per update < 1 second

Deep Archival Storage Erasure encoded block fragments Use small and widely distributed fragments to increase reliability Administrative domains are ranked by their reliability and trustworthiness Avoid locations with correlated failures

The OceanStore API Session: a sequence of reads and writes to potentially different objects Session guarantees: define the level of consistency Updates Callback: for user defined events (commit) Façade: an interface to the conventional API UNIX file system, transactional databases, WWW gateways

Introspection Observation modules monitor the activity of a running system and track system behavior Optimization modules adjust the computation computation observation optimization

Uses of Introspection Cluster recognition Identify related files Replica management Adjust replication factors Migrate floating replicas

Related Work Space/time trade-offs in hash coding with allowable errors. In Communications of the ACM, 13(7), pp , July 1970 The Bayou architecture: Support for data sharing among mobile users. In Proc. of IEEE Workshop on Mobile Computing Systems and Applications, Dec 1994

Related Work A tutorial on reed-solomon coding for faulting tolerance in raid-like systems. Software Practice and Experience, 27(9), pp , September 1997 Accessing nearby copies of replicated objects in a distributed environment. In Proc. of ACM SPAA, June 1997 Search on encrypted data. IEEE SRSP, May 2000