OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley Overview OceanStore is a global-scale data utility for Internet services How OceanStore is used Application/user data is stored in objects Objects are placed in global OceanStore infrastructure Objects are accessed via Global Unique Identifiers Objects are modified via action/predicate pairs Each operation creates new version of object Internet services (applications) define object format and content Potential Internet services Web caches, global file systems, Hotmail-like mail portals, etc. Goals Global-scale Extreme durability of data Use untrusted infrastructure Maintenance-free operation Privacy of data Automatic performance tuning Enabling technologies Peer-to-peer and overlay networks Erasure encoding and replication Byzantine agreements Repair and automatic node failover Encryption and access control Introspection and data clustering Key components: Tapestry and Inner Ring Tapestry Decentralized Object Location and Routing (DOLR) Provides routing to object independent of its location Automatically reroutes to backup nodes when failures occur Based on Plaxton algorithm Overlay network Scales for systems with large number of nodes See Tapestry poster for more information Inner Ring A set of nodes per object chosen by Responsible Party Applies updates/writes requested by user Checks all predicates and access control lists Byzantine agreement used to check and serialize updates Based on algorithm by Castro and Liskov Ensures correctness even with f of 3f+1 nodes compromised Threshold encryption used Key components: Archival Storage and Replicas Archival Storage Provides extreme durability of data objects Disseminates archival fragments throughout infrastructure Fragment replication and repair ensures durability Utilizes erasure codes Redundancy without overhead of complete replication Data objects are coded at a rate, r = m/n Produces n fragments, where any m can reconstruct object Storage overhead is n/m Replicas Full copies of data objects stored in peer-to-peer infrastructure Enables fast access Introspection allows replicas to self-organize Replicas migrate towards client accesses Encryption of objects ensures data privacy Dissemination tree is used to alert replicas of object updates Pond prototype benchmarks Update Latency (ms) 11502MB 994kB 1024b 10862MB 404kB 512b Median Time Updat e Size Key Size Latency Breakdown PhaseTime (ms) Check0.3 Serialize6.1 Apply1.5 Archive4.5 Sign77.8 Application benchmarksConclusions and future directions OceanStore’s accomplishments Major prototype completed Several fully-functional Internet services built and deployed Demonstrated feasibility of the approach Published results on system’s performance Collaborating with other global-scale research initiatives Current research directions Investigate new introspective data placement strategies Finish adding features Tentative update sharing between sessions Archival repair Replica management Improve existing performance and deploy to larger networks Examine bottlenecks Improve stability Data structure improvements Develop more applications Current status: Pond implementation complete Pond implementation All major subsystems completed Fault-tolerant inner ring, erasure-coding archive Software released to developer community outside Berkeley 280K lines of Java, JNI libraries for crypto, archive Several applications implemented See FAST paper on Pond prototype and benchmarking Deployed on PlanetLab Initiative to provide researchers with wide-area testbed ~100 hosts, ~40 sites, multiple continets Allows pond to run up to 1000 virtual nodes Have successfully run applications in wide-area Created tools to allow quick deployment to PlanetLab Current status: Pond implementation complete Pond implementation All major subsystems completed Fault-tolerant inner ring, erasure-coding archive Software released to developer community outside Berkeley 280K lines of Java, JNI libraries for crypto, archive Several applications implemented See FAST paper on Pond prototype and benchmarking Deployed on PlanetLab Initiative to provide researchers with wide-area testbed ~100 hosts, ~40 sites, multiple continents Allows pond to run up to 1000 virtual nodes Have successfully run applications in wide-area Created tools to allow quick deployment to PlanetLab Internet services built on OceanStore MINNO Global-scale system built on OceanStore Enables storage and access to user accounts Send via SMTP proxy, read and organize via IMAP MINNO stores data in four types of OceanStore objects: Folder list, Folder, Message, and Maildrop Relaxed consistency model enables fast wide-area access Riptide Web caching infrastructure Uses data migration to move web objects closer to users Verifies integrity of web content NFS Provides traditional file system support Enables time travel (reverting files/dirs) through OceanStore’s versioning primitives Many others Palm pilot synchronizer, AFS, etc. Object update latency Measures latency of inner ring Byzantine agreement commit time Shows threshold signature is costly 100 ms latency on object writes Object update throughput Measures object write throughput Base system provides 8 MBps Batch updates to get good performance Archival Storage Client Inner ring Replicas Client NFS: Andrew benchmark Client in Berkeley, server in Seattle 4.6x than NFS in read-intensive phases 7.3x slower in write-intensive phases Reasonable time w/ key size of 512 Signature time is the bottleneck MINNO: Login time Client cache sync time w/ new msg retrieval Measured time vs. latency to inner ring Simulates mobile clients MINNO adapts well with data migration and tentative commits enabled Outperforms traditional IMAP server w/ no processing overhead