CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen, Alan Mislove, and Peter Druschel Presenter: Yi Qiao

2 Outline Introduction Related Work Assumption and Targeted Environment Glacier Object Aggregation Security Evaluation Conclusion

3 Introduction How to achieve high availability in decentralized storage systems? Replication Problems –Failure is not independent –Worms make the problem worse – Catastrophic effects of losing some data Glacier –A distributed storage system that is robust to large-scale correlated failures Highly durable, decentralized storage –Trades efficiency of storage for durability –No any assumptions about the nature and correlation of failures –Aggregation of small objects and a fragment maintenance protocol to reduce the message overhead

4 Related Works OceanStore and Phoenix –Apply introspection to defend against correlated failures Difficult to capture all correlations Introspection itself can make the system vulnerable to attacks –Glacier relies on minimal assumptions about the nature of failures by using larger storage overhead TotalRecall –Optimizes availability under churn, no worst-case guarantees PAST, Farsite –Replication against data loss Weatherspoon et al. –Erasure codes can achieve better MTTF than plain replication

5 Assumptions and Intended Environment Intended to be used in an environment consisting of desktop computers within an organizational intranet –Some fraction of nodes can be home desktops connected via DSL or wireless LAN –Modest amount of churn and good network connectivity –Used on combination with conventional decentralized replication storage layer Lifetime – hundreds of days; session time – days or hours Three operation model –Normal operation –Large-scale failure – up to a fraction of fmax nodes failures Protecting data stored on non-faulty nodes –Recovery mode reconstitutes aggregates and restores missing fragments

6 Glacier Participating storage nodes form an overlay network –The set of keys forms a circular space –Each node stores objects with keys in their own key segment –Uses underlying DHT layer for secure routing and communication Operates along with a primary store with full replicas Aggregation of small objects Erasure coding of aggregation Fragments placement at random selected nodes

7 Glacier Durability guarantee –f<=fmax –Each object survives with probability p>=pmin Application Interface –put(i,v,o,l) –get(i,v) o –refresh(i,v,l) –No primitives for deletion or overwriting Leases are used, can be renewed when necessary

8 Glacier Fragments and manifests –Erasure code – reduce storage overhead Object O of size (O) is stored as n fragments F1, F2,.. Fn of size (O)/r, any r of which are sufficient for object restore Object key k, Fragment Fi key (k,i,v) –Object authenticator and Manifest AO=(H(O), H(F1), H(F2), …, H(Fn), v, l) Corrupted fragments can be detected and removed Key ownership –Keys are assigned by consistent hashing over the set of nodes that are either on-line or were on-line within a period Tmax

9 Glacier Fragment placement –Fragments of the same object placed on different random chosen nodes –Fragments of objects with similar keys should be grouped together –Placement function should be stable –P(k,i,v)=k+i/(n+1)+H(v) Primary replica – position k n fragments – n+1 equidistant points in the circular space H(v) – prevents load imbalance –When inserting new object (k,v), if owner of P(k,i,v) is offline, discard the fragment and restore later

10 Glacier Fragment maintenance –Fragment insertion misses, key space ownership change, failures may cause fragments lost –A simple protocol The node compiles a list of all keys (k,v) in its local fragment store and sends the list to some of its peers Each peer replies with a list of manifests for missing object The node requires k fragments from its peers, validates them, and computes the fragment to store locally

11 Glacier Recovery –No need for Glacier to explicitly detect failure Compromised nodes –Fail permanently, other nodes take over the key segments –Repaired and rejoin the system with an empty fragment store –Limits the number of simultaneous fragment reconstructions for a fixed number to avoid congestive collapse Garbage Collection –Happens when lease expires –Can be carried out independently by each storage node –Grace period TG for maximal clock difference

12 Glacier Configuration –An object can be reconstructed if r out of N fragment can be obtained –Change N and r so that P meets desired durability –Still offers protection even when fmax is chosen too low –Lease time must be larger than maximal duration of a large- scale failure – order of months

13 Glacier – Object Aggregation Massive redundancy – substantially large number of internal objects than application objects Aggregation of small application objects to reduce the cost of fragments creation and maintenance – tuples (oi, ki, vi) Aggregation is performed on a per-user basis –Simple, but loses the opportunity of bundle objects from different users Local aggregate directory –Aggregates link – forming a directed acyclic graph

14 Glacier – Object Aggregation Recovery –Both primary store data and the aggregated directory could be lost after a correlated failure –Recover aggregated directory by walking the DAG Consolidation –Periodically check the aggregate directory for aggregates whose leases will expire soon Not renew the lease if many objects have expired leases –Non-expired objects are consolidated with new objects to generate a new aggregate –Particularly effective when object lifetimes are bimodal Consolidated aggregate contains mostly long-lived objects

15 Glacier - Security Potential attacks against either durability or the integrity of data stored in Glacier –Attacks on integrity –Attacks on durability –Attacks on the time source –Space-filling attacks –Attacks on Glacier itself –Haystack-needle attacks

16 Evaluation Glacier prototype –On top of FreePastry implementation of the Pastry structured overlay –Uses PAST as its primary store Two sets of experiments –ePost A cooperative, server-less email system for a small groups of users Glacier used as the storage layer –Trace-driven simulations A much larger workload with 147 users and up to 1,000 nodes

17 Evaluation ePost experiments –20 to 30 nodes, mostly desktop PCs running linux –8 passive users and 9 active users –Uses Glacier to store email and corresponding metadata N=48, r=5, fmax=60%, pmin=0.999999 Experiment too small to guarantee uncorrelated fragment losses –Glacier was able to handle all the failures with the development and test of ePost

18 Evaluation ePost Workload –Cumulative size of inserted objects over time Live – objects not expired yet –Histogram of object sizes Bimodal –Large number of objects between 1-10 KB »Justified aggregation –A low number of large objects »Usually attachments

19 Evaluation ePost storage –Amount of storage required by Glacier for the workload Grows slowly as new emails entering the system XML data structure creates an additional 32% overhead –On-disk data structures VS actual email payload Storage overload close to 9.6 * 1.32

20 Evaluation ePost traffic –Five categories Insertion, Refresh, maintenance, handoff, lookup –In times with low failures, traffic dominated by insertions and refreshes –During unstable period, handoff and maintenance traffic increases

21 Evaluation ePOST aggregation –Compare the number of objects with the number of aggregates in the system Aggregation reduces the number of keys by over one order of magnitude Low number of expired objects – effective aggregate consolidation

22 Evaluation Simulation Study – Diurnal Behavior –Glacier and the aggregation layer implemented –Trace from department email server –Diurnal behavior affects message overhead Higher churn –Less insertion traffic –More maintenance message for lost fragments recovery

23 Evaluation Simulation Study – Load –See how load influences the message overhead Under light load, message overhead remains about constant –Aggregates are performed periodically by every node Higher load makes overhead increases about linearly

24 Evaluation Simulation Study – Scalability –Increase overlay size, study per-node traffic overhead Remains approximately constant Grows slowly since the messages are routed using Pastry

25 Conclusions Ensures durability of unrecoverable data in a cooperative, decentralized storage system Robust for large-scale, correlated, Byzantine failures of storage nodes No introspection Massive redundancy to mask the effects of correlated failures Erasure codes and garbage collection to reduce storage cost Aggregation and fragment maintenance protocol to reduce message costs

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

Similar presentations

Presentation on theme: "CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,

Similar presentations

Presentation on theme: "CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures Andreas Haeberlen,"— Presentation transcript:

Similar presentations

About project

Feedback