Presentation is loading. Please wait.

Presentation is loading. Please wait.

OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,

Similar presentations


Presentation on theme: "OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,"— Presentation transcript:

1 OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao A few slides are taken from John Kubiatowicz’s presentation

2 Vision what is Oceanstore? “a utility infrastructure to span the globe and provide continuous access to persistent information” Source: Berkeley OceanStore Website

3 Where does the data come from All kinds of information Desktop, laptop, palmtop Cell phones, embedded devices

4 Persistence of data Data should be affected even if device is lost or replaced reliable, durable data “deep archival” (will last forever) Automatic maintenance

5 Continuous access to data Connectivity even to tiniest devices, possibly intermittent variable bandwidth, latency Availability Highly available, replicated in a global scale comparable to LAN-based networked storage fault-tolerant, DoS-tolerant

6 How much data? scale geographically distributed Assume 10 10 users Each use has 10,000 files or objects 10 14 files / objects Each file = 1 MB 10 20 bytes = 100 Exabytes

7 Data service Economics Pay monthly fee to ISPs / data service providers Essentially, you use others’ storage capacity via subscription Cloud computing was not introduced then …

8 Assumptions Untrusted infrastructure Servers may crash or leak information, but Most of the servers functioning correctly “Financially responsible” servers ensure integrity but only clients are trusted with cleartext Nomadic data Data divorced from location, but flows freely within the storage infrastructure Promiscuous caching: “anywhere, anytime” Proximity of location important for performance

9 System overview persistent object Globally Unique Identifier GUID for each object: 160-bit SHA-1 hash secure identification – globally unique and unforgeable 2 80 unique objects before collisions (birthday paradox) Encrypted data, unless data is public read try fast probabilistic replica search (Bloom filter) fallback to slower deterministic search (Tapestry) [Tapestry is a practical P2P network that uses Plaxton routing]

10 System overview Write Update with predicates [as in Bayou] – [what is Bayou?] Update creates new version, in principle need to group updates, retire objects [Think of a shared calendar]

11 What is Bayou The Bayou System is a platform of replicated, highly-available, databases on which to build collaborative applications. Supports weak consistency models

12 System overview application interface sessions: sequence of read/writes session guarantees [Bayou] weak consistency levels, ACID Unix File Sharing semantics active and archival forms active: latest version, with update handle archive: “erasure coded” read-only version

13 Comparison with Bayou Similarities update with predicates [Example: compare version: Check if this version number is greater than X] creates new version, in principle Uses replication to improve availability at the expense of consistency Differences anti-entropy vs. promiscuous caching Encryption (unlike Bayou) Designated servers Anytime, anywhere

14 naming self-certifying path names (Mazières) (a mechanism that avoids central naming & certification & key distribution) object GUID = hash of owner key and readable name other objects server GUID = hash of public key archival GUID = hash of data read restriction (through client encryption of data) write restriction (associate ACL lists with object, respected by servers

15 addressing and routing Address an object by its GUID message: GUID, small predicate route to closest GUID replica matching predicate combines data location and routing: no central name service to attack

16 addressing and routing fast, probabilistic search algorithm Bloom filter probabilistic set membership test using bit vector n-bit vector generated from n hashes of each set element filter is union (OR) of all bit vectors attenuated Bloom filter array of d Bloom filters i th Bloom filter is union of all <i -hop nodes slow, deterministic algorithm Tapestry

17 Bloom Filter Quickly tests if an element w belongs to a set S (Taken from Wikipedia) For each element of the set S, compute K hash functions, and enter them into the corresponding slots of the array. There is a very small risk of false positive. Note that w does not belong to the set S K=3

18 Attenuated Bloom Filter Quickly tests if an object w is stored in a site S, and if not, then in which direction to route the query It gives a hint of how far away a match can possibly be found W

19 addressing and routing probabilistic deterministic

20 Attenuated Bloom Filter

21 21 Tapestry uses Plaxton Routing Using the routing tables, the query is forwarded towards the root node of the object. The root points to the server storing the object. Extra copies may be discovered earlier as the query moves towards the root. Root of O (O,S’,2) (O,S,1) (O,S,2) (O,S’,1) (O,S) Server S (O,S’,3) Server S’

22 updates based on versioning and conflict resolution i.e. no locking update: actions with predicates commit – apply action of first true predicate abort – no true predicates conflict resolution on encrypted data possible predicates: compare-version, compare-size, compare-block, search possible actions: replace-block, insert-block, delete-block, append

23 Update on ciphertext

24 updates serializing updates will not trust any single server Byzantine agreement among primary tier servers secondary tier gossips tentative data during commit multicast dissemination of commit to secondary tier primary secondary

25 archival produced when objects idle use erasure codes (redundant fragmentation) simplest example: parity bit need any (n-1) out of n fragments Reed-Solomon codes fragmentation improves reliability File Any 4 can Reconstruct the block Any 4 can Reconstruct the block

26 dynamic optimization (introspection) observation modules collect and summarize information incrementally update system database and optimization modules periodically process the observation database replica management: maintain replica count and location periodic migration: work-home-work-home… etc


Download ppt "OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,"

Similar presentations


Ads by Google