POND: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao and John Kubiatowicz UC, Berkeley File and Storage Technologies, March 2003 Presenter: Prashanth
Goals of OceanStore Provide an Internet-scale cooperative file system High Durability Universal availability Balance between privacy & information sharing Integrity
Challenges Maintenance Many components, many administrative domains Constant change Must be self-organizing Must be self-maintaining Security Must have end-to-end encryption Must not place too much trust in any one host
Assumptions Infrastructure is untrusted except in aggregate. No more than some fraction of a given set are faulty/malicious. Infrastructure is constantly changing.
OceanStore uses Tapestry Tapestry performs Distributed Object Location and Routing Locality aware Efficient O(log N ) location time Self-organizing, self-maintaining
Data Model of OceanStore The unit of storage is called Data Object. Analogous to file in a file system Ordered sequences of read-only versions.
Byzantine agreement Guarantees all non-faulty replicas agree Given N =3f +1 replicas, up to f may be faulty/corrupt Expensive Requires O(N 2 ) communication
Erasure Codes Z WW ZY X f f -1
The Path of a Write Primary Replicas HotOS Attendee Other Researchers Archival Servers (for durability) Secondary Replicas (soft state)
The prototype: Pond Coding in Java Staged Event-Driven Architecture
OceanStore PhaseNFS I II III IV V Total (times in seconds) Performance Results: Andrew Benchmark Pond faster on reads: 4.6x –Phases III and IV –Only contact primary when cache older than 30 seconds Ran Andrew on Pond –Primary replicas at UCB, UW, Stanford, Intel Berkeley –Client at UCB But slower on writes: 7.3x –Phases I, II, and V –Only 1024-bit are secure –512-bit keys show CPU cost
A closer look at Write Small writes Signature dominates Threshold sigs. slow! Takes 70+ ms to sign Compare to 5 ms for regular sig Large writes Encoding dominates Archive cost per byte Signature cost per write Phase 4 kB write 2 MB write Validate Serialize Apply Archive Sign Result (times in milliseconds)
Performance of Write
Stream Benchmark
Sources: The OceanStore Project THANK YOU!