OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
David Choffnes, Winter 2006 OceanStore Maintenance-Free Global Data StorageMaintenance-Free Global Data Storage, S. Rhea, C. Wells, P. Eaton, D. Geels,
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
1 U.S. Department of the Interior U.S. Geological Survey A Cognitive Agent Based Geospatial Data Distribution System 12 May 2006.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
OceanStore Exploiting Peer-to-Peer for a Self-Repairing, Secure and Persistent Storage Utility John Kubiatowicz University of California at Berkeley.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01 John Kubiatowicz University of California at Berkeley.
OceanStore Global-Scale Persistent Storage John Kubiatowicz.
OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
1 OceanStore Global-Scale Persistent Storage Ying Lu CSCE496/896 Spring 2011.
OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter: Chunyuan Liao March 6, 2002 Ben Y.Zhao, John Kubiatowicz, and.
Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.
The Oceanic Data Utility: (OceanStore) Global-Scale Persistent Storage John Kubiatowicz.
OceanStore: Data Security in an Insecure world John Kubiatowicz.
OceanStore Theoretical Issues and Open Problems John Kubiatowicz University of California at Berkeley.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Opportunities for Continuous Tuning in a Global Scale File System John Kubiatowicz University of California at Berkeley.
OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage Anthony D. Joseph John Kubiatowicz Sahara Retreat, January 2003.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Wide-area cooperative storage with CFS
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
Arnold N. Pears, CoRE Group Uppsala University 3 rd Swedish Networking Workshop Marholmen, September Why Tapestry is not Pastry Presenter.
Jan 17, 2001CSCI {4,6}900: Ubiquitous Computing1 Announcements I will be out of town Monday and Tuesday to present at Multimedia Computing and Networking.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
OceanStore: In Search of Global-Scale, Persistent Storage John Kubiatowicz UC Berkeley.
Distributed Architectures. Introduction r Computing everywhere: m Desktop, Laptop, Palmtop m Cars, Cellphones m Shoes? Clothing? Walls? r Connectivity.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
OceanStore: An Architecture for Global- Scale Persistent Storage.
Freenet “…an adaptive peer-to-peer network application that permits the publication, replication, and retrieval of data while protecting the anonymity.
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage
OceanStore: An Architecture for Global-Scale Persistent Storage
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Plethora: Infrastructure and System Design
Accessing nearby copies of replicated objects
OceanStore August 25, 2003 John Kubiatowicz
OceanStore: Data Security in an Insecure world
John D. Kubiatowicz UC Berkeley
Content Distribution Network
Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.
Presentation transcript:

OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:2Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley OceanStore Context: Ubiquitous Computing Computing everywhere: –Desktop, Laptop, Palmtop –Cars, Cellphones –Shoes? Clothing? Walls? Connectivity everywhere: –Rapid growth of bandwidth in the interior of the net –Broadband to the home and office –Wireless technologies such as CMDA, Satelite, laser

OceanStore:3Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Questions about information: Where is persistent information stored? –Want: Geographic independence for availability, durability, and freedom to adapt to circumstances How is it protected? –Want: Encryption for privacy, signatures for authenticity, and Byzantine commitment for integrity Can we make it indestructible? –Want: Redundancy with continuous repair and redistribution for long-term durability Is it hard to manage? –Want: automatic optimization, diagnosis and repair Who owns the aggregate resouces? –Want: Utility Infrastructure!

OceanStore:4Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley First Observation: Want Utility Infrastructure Mark Weiser from Xerox: Transparent computing is the ultimate goal –Computers should disappear into the background In storage context: –Don’t want to worry about backup –Don’t want to worry about obsolescence –Need lots of resources to make data secure and highly available, BUT don’t want to own them –Outsourcing of storage already becoming popular Pay monthly fee and your “data is out there” –Simple payment interface  one bill from one company

OceanStore:5Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Second Observation: Want Automatic Maintenance Can’t possibly manage billions of servers by hand! System should automatically: –Adapt to failure –Repair itself –Incorporate new elements Can we guarantee data is available for 1000 years? –New servers added from time to time –Old servers removed from time to time –Everything just works Many components with geographic separation –System not disabled by natural disasters –Can adapt to changes in demand and regional outages –Gain in stability through statistics

OceanStore:6Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Utility-based Infrastructure Pac Bell Sprint IBM AT&T Canadian OceanStore IBM Transparent data service provided by federation of companies: –Monthly fee paid to one service provider –Companies buy and sell capacity from each other

OceanStore:7Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley OceanStore: Everyone’s Data, One Big Utility “The data is just out there” How many files in the OceanStore? –Assume people in world –Say 10,000 files/person (very conservative?) –So files in OceanStore! –If 1 gig files (ok, a stretch), get 1 mole of bytes! Truly impressive number of elements… … but small relative to physical constants Aside: new results: 1.5 Exabytes/year (1.5  )

OceanStore:8Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Outline Motivation Assumptions of the OceanStore Specific Technologies and approaches: –Naming –Routing and Data Location –Conflict resolution on encrypted data –Replication and Deep archival storage –Oceanic Data Market –Introspection for optimization and repair Conclusion

OceanStore:9Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley OceanStore Assumptions Untrusted Infrastructure: –The OceanStore is comprised of untrusted components –Only ciphertext within the infrastructure Responsible Party: –Some organization (i.e. service provider) guarantees that your data is consistent and durable –Not trusted with content of data, merely its integrity Mostly Well-Connected: –Data producers and consumers are connected to a high- bandwidth network most of the time –Exploit multicast for quicker consistency when possible Promiscuous Caching: –Data may be cached anywhere, anytime Optimistic Concurrency via Conflict Resolution: –Avoid locking in the wide area –Applications use object-based interface for updates

OceanStore:10Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Use of Moore’s law gains Question: Can we use Moore’s law gains for something other than just raw performance? –Growth in computational performance –Growth in network bandwidth –Growth in storage capacity Examples: –Stability through Statistics Use of redundancy of servers, network packets, etc. in order to gain more predictable behavior Systems version of Thermodynamics! –Extreme Durability (1000-year time scale?) Use of erasure coding and continuous repair –Security and Authentication Signatures and secure hashes in many places –Continuous dynamic optimization

OceanStore:11Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Basic Structure: Irregular Mesh of “Pools”

OceanStore:12Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Secure Naming Unique, location independent identifiers: –Every version of every unique entity has a permanent, Globally Unique ID (GUID) –All OceanStore operations operate on GUIDs Naming hierarchy: –Users map from names to GUIDs via hierarchy of OceanStore objects (ala SDSI) –Requires set of “root keys” to be acquired by user Each link is either a GUID (RO) Or a GUID/public key combination Foo Bar Baz Myfile Out-of-Band “Root link”

OceanStore:13Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Unique Identifiers Secure Hashing is key! –Use of 160-bit SHA-1 hashes over information provides uniqueness, unforgeability, and verifiability: –Read-only data: GUID is hash over actual information Uniqueness and Unforgeability: the data is what it is! Verification: check hash over data –Changeable data: GUID is combined hash over a human- readable name + public key Uniqueness: GUID space selected by public key Unforgeability: public key is indelibly bound to GUID Verification: check signatures with public key Is 160 bits enough? –Birthday paradox requires over 2 80 unique objects before collisions worrisome –Good enough for now

OceanStore:14Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley GUIDs  Secure Pointers Name+Key Active GUID Global Object Resolutions Floating Replica (Active Object) Active Data Commit Logs CKPoint GUID Archival GUID Signature RP Keys ACLs MetaData Global Object Resolution Archival copy or snapshot Archival copy or snapshot Archival copy or snapshot Erasure Coded Archival GUID Signature Inactive Object Global Object Resolution

OceanStore:15Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Routing and Data Location Requirements: –Find data quickly, wherever it might reside Locate nearby data without global communication Permit rapid data migration –Insensitive to faults and denial of service attacks Provide multiple routes to each piece of data Route around bad servers and ignore bad data –Repairable infrastructure Easy to reconstruct routing and location information Technique: Combined Routing and Data Location –Packets are addressed to GUIDs, not locations –Infrastructure gets the packets to their destinations and verifies that servers are behaving

OceanStore:16Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Two-levels of Routing Fast, probabilistic search for “routing cache”: –Built from attenuated bloom filters –Approximation to gradient search –Not going to say more about this today Redundant Plaxton Mesh used for underlying routing infrastructure: –Randomized data structure with locality properties –Redundant, insensitive to faults, and repairable –Amenable to continuous adaptation to adjust for: Changing network behavior Faulty servers Denial of service attacks

OceanStore:17Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Basic Plaxton Mesh Incremental suffix-based routing NodeID 0x43FE NodeID 0x13FE NodeID 0xABFE NodeID 0x1290 NodeID 0x239E NodeID 0x73FE NodeID 0x423E NodeID 0x79FE NodeID 0x23FE NodeID 0x73FF NodeID 0x555E NodeID 0x035E NodeID 0x44FE NodeID 0x9990 NodeID 0xF990 NodeID 0x993E NodeID 0x04FE NodeID 0x43FE

OceanStore:18Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Use of Plaxton Mesh Randomization and Locality

OceanStore:19Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Use of the Plaxton Mesh: The Tapestry Infrastructure As in original Plaxton scheme: –Scheme to directly map GUIDs to root node IDs –Replicas publish toward a document root –Search walks toward root until pointer located  locality! OceanStore enhancements for reliability: –Documents have multiple roots (Salted hash of GUID) –Each node has multiple neighbor links –Searches proceed along multiple paths Tradeoff between reliability and bandwidth? –Routing-level validation of query results

OceanStore:20Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley How important are the pointers? Transit-Stub IP network of 5000 nodes 100 Tapestry overlay nodes

OceanStore:21Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Automatic Maintenance All Tapestry state is Soft State –State maintained during transformations of network –Periodic restoration of state Self-Tuning of link structure Dynamic insertion: –New nodes contact small number of existing nodes –Integrate themselves automatically –Later, introspective optimization will move data to new servers Dynamic deletion: –Node detected as unresponsive –Pointer state routed around faulty node (signed deletion requests authorized by servers holding data)

OceanStore:22Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley OceanStore Consistency via Conflict Resolution Consistency is form of optimistic concurrency –An update packet contains a series of predicate-action pairs which operate on encrypted data –Each predicate tried in turn: If none match, the update is aborted Otherwise, action of first true predicate is applied Role of Responsible Party –All updates submitted to Responsible Party which chooses a final total order –Byzantine agreement with threshold signatures This is powerful enough to synthesize: –ACID database semantics –release consistency (build and use MCS-style locks) –Extremely loose (weak) consistency

OceanStore:23Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Oblivious Updates on Encrypted Data? Tentative Scheme: –Divide data into small blocks –Updates on a per-block basis –Predicates derived from techniques for searching on encrypted data Still exploring other options TimeStamp Client ID {Pred1, Update1} {Pred2, Update2} {Pred3, Update3} Client Signature Unique Update ID is hash over packet

OceanStore:24Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley The Path of an OceanStore Update

OceanStore:25Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Automatic Maintenance Byzantine Commitment for inner ring: –Can tolerate up to 1/3 faulty servers in inner ring Bad servers can be arbitrarily bad Cost ~n 2 communication –Continuous refresh of set of inner-ring servers Proactive threshold signatures Use of Tapestry  membership of inner ring unknown to clients Secondary tier self-organized into overlay dissemination tree –Use of Tapestry routing to suggest placement of replicas in the infrastructure –Automatic choice between update vs invalidate

OceanStore:26Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Data Coding Model Two distinct forms of data: active and archival Active Data in Floating Replicas –Per object virtual server –Logging for updates/conflict resolution –Interaction with other replicas to keep data consistent –May appear and disappear like bubbles Archival Data in Erasure-Coded Fragments –M-of-n coding: Like hologram Data coded into n fragments, any m of which are sufficient to reconstruct (e.g m=16, n=64) Coding overhead is proportional to n  m (e.g 4) Law of large numbers advantage to fragmentation –Fragments are self-verifying –OceanStore equivalent of stable store Most data in the OceanStore is archival!

OceanStore:27Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Archival Dissemination of Fragments

OceanStore:28Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Floating Replica and Deep Archival Coding Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs

OceanStore:29Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Statistical Advantage of Fragments Latency and standard deviation reduced: –Memory-less latency model –Rate ½ code with 32 total fragments

OceanStore:30Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Automatic Maintenance Tapestry knows every location of fragment –Reverse pointer traversing permits multicast heartbeat from server of data to: Everyone who has a pointer to data on that server Or: Everyone who is one or two hops from server –Permits infrastructure to notice: Servers going away for a while Or, going away forever! Continuous sweep through data –Simple calculation of 1 mole repair Rate ¼, 10 billion servers, 64 fragments, 1 Mbit/sec bandwidth

OceanStore:31Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Introspective Optimization Monitoring and adaptation of routing substrate –Optimization of Plaxton Mesh –Adaptation of second-tier multicast tree Continuous monitoring of access patterns: –Clustering algorithms to discover object relationships Clustered prefetching: demand-fetching related objects Proactive-prefetching: get data there before needed –Time series-analysis of user and data motion Continuous testing and repair of information –Slow sweep through all information to make sure there are sufficient erasure-coded fragments –Continuously reevaluate risk and redistribute data –Diagnosis and repair of routing and location infrastructure –Provide for 1000-year durability of information?

OceanStore:32Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley Can You Delete (Eradicate) Data? Eradication is antithetical to durability! –If you can eradicate something, then so can someone else! (denial of service) –Must have “eradication certificate” or similar Some answers: –Bays: limit the scope of data flows –Ninja Monkeys: hunt and destroy with certificate Related: Revocation of keys –Need hunt and re-encrypt operation Related: Version pruning –Temporary files: don’t keep versions for long –Streaming, real-time broadcasts: Keep? Maybe –Locks: Keep? No, Yes, Maybe (auditing!) –Every key stroke made: Keep? For a short while?

OceanStore:33Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley First Implementation [Java]: Event-driven state-machine model Included Components Initial floating replica design Conflict resolution and Byzantine agreement Routing facility (Tapestry) Bloom Filter location algorithm Plaxton-based locate and route data structures Introspective gathering of tacit info and adaptation Language for introspective handler construction Clustering, prefetching, adaptation of network routing Initial archival facilities Interleaved Reed-Solomon codes for fragmentation Methods for signing and validating fragments Target Applications Unix file-system interface under Linux (“legacy apps”) – application, proxy for web caches, streaming multimedia applications

OceanStore:34Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley OceanStore Conclusions OceanStore: everyone’s data, one big utility –Global Utility model for persistent data storage OceanStore assumptions: –Untrusted infrastructure with a responsible party –Mostly connected with conflict resolution –Continuous on-line optimization OceanStore properties: –Provides security, privacy, and integrity –Provides extreme durability –Lower maintenance cost through redundancy, continuous adaptation, self-diagnosis and repair –Large scale system has good statistical properties

OceanStore:35Stanford Seminar Series©2001 John Kubiatowicz/UC Berkeley For more info: OceanStore vision paper for ASPLOS 2000 “OceanStore: An Architecture for Global-Scale Persistent Storage” OceanStore web site: