OceanStore: In Search of Global-Scale, Persistent Storage John Kubiatowicz UC Berkeley
OceanStore:2FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Context: Ubiquitous Computing Computing everywhere: –Desktop, Laptop, Palmtop –Cars, Cellphones –Shoes? Clothing? Walls? Connectivity everywhere: –Rapid growth of bandwidth in the interior of the net –Broadband to the home and office –Wireless technologies such as CMDA, Satelite, laser
OceanStore:3FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Utility-based Infrastructure? Pac Bell Sprint IBM AT&T Canadian OceanStore IBM Data service provided by federation of companies Cross-administrative domain Metric: MOLE OF BYTES (6 )
OceanStore:4FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Assumptions Untrusted Infrastructure: –The OceanStore is comprised of untrusted components –Only ciphertext within the infrastructure Responsible Party: –Some organization (i.e. service provider) guarantees that your data is consistent and durable –Not trusted with content of data, merely its integrity Mostly Well-Connected: –Data producers and consumers are connected to a high- bandwidth network most of the time –Exploit multicast for quicker consistency when possible Promiscuous Caching: –Data may be cached anywhere, anytime
OceanStore:5FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Key Observation: Want Automatic Maintenance Can’t possibly manage billions of servers by hand! System should automatically: –Adapt to failure –Repair itself –Incorporate new elements Introspective Computing/Autonomic Computing Can data be accessible for 1000 years? –New servers added from time to time –Old servers removed from time to time –Everything just works
OceanStore:6FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Outline Motivation Assumptions of the OceanStore Specific Technologies and approaches: –Routing and Data Location –Naming –Conflict resolution on encrypted data –Replication and Deep archival storage –Introspection for optimization and repair Conclusion
OceanStore:7FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Basic Structure: Irregular Mesh of “Pools”
OceanStore:8FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Bringing Order to this Chaos How do you find information? –Must be scalable and provide maximum flexibility How do you name information? –Must provide global uniqueness How do you ensure consistency? –Must scale and handle intermittent connectivity –Must prevent unauthorized update of information How do you protect information? –Must preserve privacy –Must provide deep archival storage (continuous repair) How do go tune performance? –Locality very important Throughout all of this: how do you maintain it???
OceanStore:9FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Location and Routing
OceanStore:10FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Locality, Locality, Locality One of the defining principles “The ability to exploit local resources over remote ones whenever possible” “-Centric” approach –Client-centric, server-centric, data source-centric Requirements: –Find data quickly, wherever it might reside –Locate nearby object without global communication –Permit rapid object migration –Verifiable: can’t be sidetracked Locality yields: Performance, Availability, Reliability
OceanStore:11FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Enabling Technology: DOLR (Decentralized Object Location and Routing) GUID1 Tapestry GUID1 GUID2
OceanStore:12FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Stability under Changes Unstable, unreliable, untrusted nodes are the common case! –Network never fully stabilizes –What is half-life of a routing node? –Must provide stable routing in these circumstances Redundancy and adaptation fundamental: –Make use of alternative paths when possible –Incrementally remove faulty nodes –Route around network faults –Continuously tune neighbor links
OceanStore:13FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Tapestry DOLR Routing to Objects, not Locations! –Replacement for IP? –Very powerful abstraction Built as overlay network, but not fundamental –Randomized prefix routing + distributed object location index –Routing nodes have links to nearby neighbors –Additional state tracks objects Massive parallel insert (SPAA 2002) –Construction of nearest-neighbor mesh links Log 2 n message complexity for new node –New nodes integrated, faulty ones removed –Objects kept available during this process
OceanStore:14FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Naming
OceanStore:15FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Model of Data Ubiquitous object access from anywhere –Undifferentiated “Bag of Bits” Versioned Objects –Every update generates a new version –Can always go back in time (Time Travel) Each Version is Read-Only –Can have permanent name (SHA-1 Hash) –Much easier to repair An Object is a signed mapping between permanent name and latest version –Write access control/integrity involves managing these mappings Comet Analogy updates versions
OceanStore:16FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Secure Hashing Read-only data: GUID is hash over actual information –Uniqueness and Unforgeability: the data is what it is! –Verification: check hash over data Changeable data: GUID is combined hash over a human-readable name + public key –Uniqueness: GUID space selected by public key –Unforgeability: public key is indelibly bound to GUID –Verification: check signatures with public key SHA-1 DATA 160-bit GUID
OceanStore:17FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Secure Naming Naming hierarchy: –Users map from names to GUIDs via hierarchy of OceanStore objects (ala SDSI) –Requires set of “root keys” to be acquired by user Foo Bar Baz Myfile Out-of-Band “Root link”
OceanStore:18FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Write Path
OceanStore:19FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Path of an OceanStore Update Second-Tier Caches Multicast trees Inner-Ring Servers Clients
OceanStore:20FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Consistency via Conflict Resolution Consistency is form of optimistic concurrency –An update packet contains a series of predicate-action pairs which operate on encrypted data –Each predicate tried in turn: If none match, the update is aborted Otherwise, action of first true predicate is applied Inner Ring must securely: –Pick serial order of updates –Apply them –Sign result (threshold signature) –Disseminate results to active users
OceanStore:21FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Automatic Maintenance Byzantine Commitment for inner ring: –Tolerates up to 1/3 malicious servers in inner ring –Continuous refresh of set of inner-ring servers Proactive threshold signatures Use of Tapestry membership of inner ring unknown to clients Secondary tier self-organized into overlay dissemination tree –Use of Tapestry routing to suggest placement of replicas in the infrastructure –Automatic choice between update vs invalidate
OceanStore:22FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Self-Organizing Soft-State Replication Simple algorithms for placing replicas on nodes in the interior –Intuition: locality properties of Tapestry help select positions for replicas –Tapestry helps associate parents and children to build multicast tree Preliminary results show that this is effective
OceanStore:23FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Deep Archival Storage
OceanStore:24FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley TwoTypes of OceanStore Data Active Data: “Floating Replicas” –Per object virtual server –Logging for updates/conflict resolution –Interaction with other replicas for consistentency –May appear and disappear like bubbles Archival Data: OceanStore’s Stable Store –m-of-n coding: Like hologram Data coded into n fragments, any m of which are sufficient to reconstruct (e.g m=16, n=64) Coding overhead is proportional to n m (e.g 4) Other parameter, rate, is 1/overhead –Fragments are cryptographically self-verifying Most data in the OceanStore is archival!
OceanStore:25FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Archival Dissemination of Fragments
OceanStore:26FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Fraction of Blocks Lost per Year (FBLPY) Exploit law of large numbers for durability! 6 month repair, FBLPY: –Replication: 0.03 –Fragmentation:
OceanStore:27FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Dissemination Process: Achieving Failure Independence Model Builder Set Creator Introspection Human Input Network Monitoring model Inner Ring set probe type fragments
OceanStore:28FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Automatic Maintenance Continuous Entropy Suppression – i.e. repair! –Erasure coding give flexibility in timing repair Data continuously transferred from physical medium to physical medium –No “tapes decaying in basement” Actual Repair –Recombine fragments, then send out copies again –DOLR permits efficient heartbeat mechanism Permits infrastructure to notice: –Servers going away for a while –Or, going away forever! –Continuous sweep through data
OceanStore:29FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Introspective Tuning
OceanStore:30FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley On the use of Redundancy Question: Can we use Moore’s law gains for something other than just raw performance? –Growth in computational performance –Growth in network bandwidth –Growth in storage capacity Physical systems are unreliable and untrusted –Can we use multiple faulty elements instead of one? –Can we devote resources to monitoring and analysis? –Can we devote resources to repairing systems? Complexity of systems growing rapidly –Can no longer debug systems entirely –How to handle this?
OceanStore:31FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Biological Inspiration Biological Systems are built from (extremely) faulty components, yet: –They operate with a variety of component failures Redundancy of function and representation –They have stable behavior Negative feedback –They are self-tuning Optimization of common case Introspective Computing: –Components for computing –Components for monitoring and model building –Components for continuous adaptation Adapt Compute Monitor
OceanStore:32FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Thermodynamic Analogy System such as OceanStore has a variety of latent order –Connections between elements –Mathematical structure (erasure coding, etc) –Distributions peaked about some desired behavior Permits “Stability through Statistics” –Exploit the behavior of aggregates Subject to Entropy –Servers fail, attacks happen, system changes Requires continuous repair –Apply energy (i.e. through servers) to reduce entropy
OceanStore:33FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Introspective Optimization Adaptation of routing substrate –Optimization of Tapestry Mesh –Fault-tolerant routing mechanisms –Adaptation of second-tier multicast tree Monitoring of access patterns: –Clustering algorithms to discover object relationships –Time series-analysis of user and data motion Observations of system behavior –Extracting of failure correllations Continuous testing and repair of information –Slow sweep through all information to make sure there are sufficient erasure-coded fragments –Continuously reevaluate risk and redistribute data
OceanStore:34FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley PondStore [Java]: Event-driven state-machine model Included Components Initial floating replica design Conflict resolution and Byzantine agreement Routing facility (Tapestry) Bloom Filter location algorithm Plaxton-based locate and route data structures Introspective gathering of tacit info and adaptation Language for introspective handler construction Clustering, prefetching, adaptation of network routing Initial archival facilities Interleaved Reed-Solomon codes for fragmentation Methods for signing and validating fragments Target Applications Unix file-system interface under Linux (“legacy apps”) application, proxy for web caches, streaming multimedia applications
OceanStore:35FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley We have Things Running! Latest: it is up to 7MB/sec Still a ways to go, but working
OceanStore:36FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Update Latency Cryptography in critical path (not surprising!) New metric: Avoid hashes (like avoid copies)
OceanStore:37FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Goes Global! OceanStore components running “globally:” –Australia, Georgia, Washington, Texas, Boston –Able to run the Andrew File-System benchmark with inner ring spread throughout US –Interface: NFS on OceanStore Word on the street: it was easy to do –The components were debugged locally –Easily set up remotely I am currently talking with people in: –England, Maryland, Minnesota, …. –PlanetLab testbed will give us access to much more
OceanStore:38FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Reality: Web Caching through OceanStore
OceanStore:39FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Other Apps Better file system support –NFS (working – reimplementation in progress) –Windows Installable file system (soon) through OceanStore –IMAP and POP proxies –Let normal mail clients access mailboxes in OS Palm-pilot synchronization –Palm data base as an OceanStore DB
OceanStore:40FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Conclusions OceanStore: everyone’s data, one big utility –Global Utility model for persistent data storage OceanStore assumptions: –Untrusted infrastructure with a responsible party –Mostly connected with conflict resolution –Continuous on-line optimization OceanStore properties: –Provides security, privacy, and integrity –Provides extreme durability –Lower maintenance cost through redundancy, continuous adaptation, self-diagnosis and repair –Large scale system has good statistical properties
OceanStore:41FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley For more info: OceanStore vision paper for ASPLOS 2000 “OceanStore: An Architecture for Global-Scale Persistent Storage” Tapestry algorithms paper (SPAA 2002): “Distributed Object Location in a Dynamic Network” Bloom Filters for Probabilistic Routing (INFOCOM 2002): “Probabilistic Location and Routing” OceanStore web site: