OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,

Slides:



Advertisements
Similar presentations
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Advertisements

What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Scalable Content-Addressable Network Lintao Liu
POND: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao and John Kubiatowicz UC, Berkeley File and Storage.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
David Choffnes, Winter 2006 OceanStore Maintenance-Free Global Data StorageMaintenance-Free Global Data Storage, S. Rhea, C. Wells, P. Eaton, D. Geels,
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
The Oceanstore Regenerative Wide-area Location Mechanism Ben Zhao John Kubiatowicz Anthony Joseph Endeavor Retreat, June 2000.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Scalable Adaptive Data Dissemination Under Heterogeneous Environment Yan Chen, John Kubiatowicz and Ben Zhao UC Berkeley.
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter: Chunyuan Liao March 6, 2002 Ben Y.Zhao, John Kubiatowicz, and.
Implementation of a Tapestry Node: The main components: The core router, utilizes the routing and object reference tables to handle messages, The node.
OceanStore: Data Security in an Insecure world John Kubiatowicz.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Routing.
Decentralized Location Services CS273 Guest Lecture April 24, 2001 Ben Y. Zhao.
OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage Anthony D. Joseph John Kubiatowicz Sahara Retreat, January 2003.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Wide-area cooperative storage with CFS
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
P2P Course, Structured systems 1 Skip Net (9/11/05)
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
“Umbrella”: A novel fixed-size DHT protocol A.D. Sotiriou.
Tapestry An off-the-wall routing protocol? Presented by Peter, Erik, and Morten.
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
1 Plaxton Routing. 2 Introduction Plaxton routing is a scalable mechanism for accessing nearby copies of objects. Plaxton mesh is a data structure that.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
CH2 System models.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
Brocade Landmark Routing on P2P Networks Gisik Kwon April 9, 2002.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
SOS: Security Overlay Service Angelos D. Keromytis, Vishal Misra, Daniel Rubenstein- Columbia University ACM SIGCOMM 2002 CONFERENCE, PITTSBURGH PA, AUG.
Distributed Architectures. Introduction r Computing everywhere: m Desktop, Laptop, Palmtop m Cars, Cellphones m Shoes? Clothing? Walls? r Connectivity.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
OceanStore: An Architecture for Global- Scale Persistent Storage.
Freenet “…an adaptive peer-to-peer network application that permits the publication, replication, and retrieval of data while protecting the anonymity.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Peer to Peer Network Design Discovery and Routing algorithms
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.
Bruce Hammer, Steve Wallis, Raymond Ho
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Pastry Scalable, decentralized object locations and routing for large p2p systems.
OceanStore: An Architecture for Global-Scale Persistent Storage
Accessing nearby copies of replicated objects
ECE 544 Protocol Design Project 2016
Distributed P2P File System
OceanStore: Data Security in an Insecure world
Content Distribution Network
Routing.
Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.
Presentation transcript:

OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao Presented by Yongbo Wang, Hailing Yu

Ubiquitous Computing

OceanStore Overview A global-scale utility infrastructure Internet-based, distributed storage system for information appliances such as computers, PDAs, cellular phones,… It is designed to support users, each having 10 4 data files (Support over files)

OceanStore Overview (cont) Automatically recovers from server and network failures Utilizes redundancy and client-side cryptographic techniques to protect data Allows replicas of a data object to exist anywhere, at any time Incorporates new resources Adjusts to usage patterns

OceanStore Two Unique design goals: Ability to be constructed from an untrusted infrastructure Servers may crash Information can be stolen Support of nomadic data Data can be cached anywhere, anytime (promiscuous caching) Data is separated from its physical location

Underlying Technologies Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Naming Objects are identified by a globally unique identifier (GUID) Different objects in OceanStore use different mechanism to generate their GUID

Underlying Technologies Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Access Control Reader Restriction Encrypt the data that is not public Distribute the encryption key to users having read permission Writer Restriction The owner of an object can decide an access control list (ACL) for the object All writes are verified by well-behaved servers and clients based on the ACL.

Underlying Technologies Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Data Location and Routing Provides necessary service to route messages to their destinations and to locate objects in the system Works on top of IP

Data Location and Routing Each object in the system is identified by a globally unique identifier,GUID (a pseudo-random fixed length bit string) An object GUID is a secure hash function over the object’s contents OceanStore uses 160-bit SHA-1 hash for which the probability that two out of objects hash to the same value is approximately 1 in

Data Location and Routing In OceanStore system, entities that are accessed frequently are likely to reside close to where they are being used Two-tiered approach: First use a fast probabilistic algorithm If necessary, use a slower but reliable hierarchical algorithm

Probabilistic algorithm Each server has a set of neighbors, chosen from servers closest to it in network latency A server associates with each neighbor a probability of finding each object in the system through that neighbor This association is maintained in constant space using an attenuated Bloom filter

Bloom Filters An efficient, lossy way of describing sets A Bloom filter is a bit-vector of length w with a family of hash functions Each hash function maps the elements of the represented set to an integer in [0,w) To form a representation of a set, each set element is hashed and the bits in the vector corresponding to has functions’ results are set

Bloom Filters To check if an element is in the set Element is hashed Corresponding bits in the filter are checked - If any of the bits are not set, it is not in the set - If all bits are set, it may be in the set The element may not be in the set even if all of the hashed bits are set (false positive) False positive rate of a Bloom filter is a linear function of its width, number of hash functions and cardinality of the represented set

A Bloom Filter: To check an object’s name against a Bloom filter summary, the name is hashed with n different hash functions (here, n=3) and bits corresponding to the result are checked

Attenuated Bloom Filters An attenuated Bloom filter of depth d is an array of d normal bloom filters For each neighbor link, an attenuated Bloom filter is kept The k th bloom filter in the array is the merger of all Bloom filters for all of the nodes k hops away through any path starting with that neighbor link

Attenuated Bloom Filter for the outgoing link A  B In F AB,the document “Uncle John’s Band” would map to potential value 1/4+1/8=3/8.

The Query Algorithm The query node examines the 1 st level of each of its neighbors’ filters If matches are found, the query is forwarded to closest neighbor If no filter matches, the querying node examines the next level of each filter at each step and forwards the query if a matching node founds

The probabilistic query process: n 1 is looking for object X, which is hashed to bits 0,1, and 3.

Probabilistic location and routing A filter of depth d stores information about servers d hops from the server If a query reaches a server d hops away from its source due to a false positive, it is not forwarded further In this case, the probabilistic algorithm gives up and forwards the query to deterministic algorithm

Deterministic location and routing Tapestry: OceanStore’s self-organizing routing and object location subsystem IP overlay network with a distributed, fault tolerant architecture A query is routed from node to node until the location of a replica is discovered

Tapestry A hierarchical distributed data structure Every server is assigned a random and unique node-ID The node-ID ’s are then used to construct a mesh of neighbor links

Tapestry Every node is connected to other nodes via neighbor links of various levels Level-1 edges connect to a set of nodes closest in network latency with different values in the lowest digit of their node-ID’s Level-2 edges connect to the closest nodes that match in the lowest digit and different second digits, etc.

Tapestry Each node has a neighbor map with multiple levels for example, the 9 th entry of the 4 th level for node 325AE is the node closest to 325AE which ends in 95AE Messages are routed to the destination ID digit by digit ***8=>**98=>*598=>4598

Neighbor Map for Tapestry node 0642

Tapestry routing example: A potential path for a message originating at node 0325 destined for node 4598

Tapestry Each object is associated with a location root through a deterministic mapping function To advertise an object o, the server s storing the object sends a publish message toward the object’s root, leaving location pointers at each hop

Tapestry routing example: To publish an object, the server storing the object sends a publish message toward the object’s root (e.g. node 4598), leaving location pointers at each node

Locating an object To locate an object, a client sends a message toward the object’s root. When the message encounters a pointer, it routes directly to the object It is proved that Tapestry can route the request to the asymptotically optimal node (in terms of the shortest path network distance) containing a replica

Tapestry routing example: To locate an object, node 0325 sends a message toward the object’s root (e.g. node 4598)

Data Location and Routing Fault tolerance: Tapestry uses redundant neighbor pointers when it detects a primary route failure Uses periodic UDP probes to check link conditions Tapestry deterministically chooses multiple root nodes for each object

Data Location and Routing Automatic repair: Node insertions: A new node needs the address of at least one existing node It then starts advertising its services and the roles it can assume to the system through the existing node Exiting nodes: If possible, the exiting node runs a shutdown script to inform the system In any case, neighbors will detect its absence and update routing tables accordingly

Underlying Technologies Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Updates Updates are made by clients and all updates are logged OceanStore allows concurrent updates Serializing updates: Since the infrastructure is untrusted, using a master replica will not work Instead, a group of server’s called inner ring is responsible for choosing final commit order

Update commitment Inner ring is a group of servers working on behalf of an object. It consists of a small number of highly- connected servers. Each object has an inner ring which can be located through Tapestry

Inner ring An object’s inner ring, Generates new versions of an object from client updates Generates encoded, archival fragments and distributes them Provides mapping from active GUID to the GUID of most recent version of the object Verifies a data object’s legitimate writers Maintains an update history providing an undo mechanism

Update commitment Each inner ring makes its decisions through a Byzantine agreement protocol Byzantine agreement lets a group of 3n+1 servers reach a agreement whenever no more than n of them are faulty

Update commitment Other nodes containing the data of that object are called secondary nodes They do not participate in serialization protocol They are organized into one or more multicast trees (dissemination trees)

Path of an update: a)After generating an update, a client sends it directly to the object’s inner ring b)While inner ring performs a Byzantine agreement to commit the update, secondary nodes propagate the update among themselves c)The result of update is multicast down the dissemination tree to all secondary nodes

Cost of an update in bytes sent across the network, normalized to minimum cost needed to send the update to each of the replicas

Update commitment Fault tolerance: Guarantees fault tolerance if less than one third of the servers in the inner ring is malicious Secondary nodes do not participate in the Byzantine protocol, but receive consistency information

Update commitment Automatic repair: Servers of the inner ring can be changed without affecting the rest of the system Servers participating in the inner ring are altered continuously to maintain the Byzantine assumption

Underlying Technologies Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

Deep Archival Storage Each object is treated as a series of m fragments and then transformed into n fragments, where n>m That uses Reed-Solomon encoding. Any m of the n coded fragments are sufficient to construct the original data Rate of encoding: r=m/n Storage overhead=1/r=n/m

Underlying Technologies Naming Access control Data Location and Routing Data Update Deep Archival Storage Introspection

It is impossible to manually administer millions of servers and objects OceanStore contains introspection tools Event monitoring Event analysis Self-adaptation

Introspection Introspective modules on servers observe network traffic and measure local traffic. They automatically create, replace, and remove replicas in response to object’s usage patterns

Introspection If a replica becomes unavailable: Clients will receive service from a more distant replica This produces extra load on distant replicas Introspective mechanism detects this and new replicas are created Above actions provide fault tolerance and automatic repair

Event handlers summarizes local events. These summaries are stored in a database. The information in the database is periodically analyzed and necessary actions are taken. A summary is sent to other nodes.

Conclusion OceanStore provides a global-scale, distributed storage platform through adaptation, fault tolerance and repair It is self-maintaining A prototype implemented in Java is under construction at UC Berkeley. Although it is not operational yet, many components are already functioning in isolation

The end… Questions?