Presentation is loading. Please wait.

Presentation is loading. Please wait.

Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley.

Similar presentations


Presentation on theme: "Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley."— Presentation transcript:

1 Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

2 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:2 Questions About Information: Where is persistent information stored? –Want: Geographic independence for availability, durability, and freedom to adapt to circumstances

3 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:3 m of n Encoding Redundant Fragments Data / Object Fragment Received Fragment Not Received Redundancy without overhead of replication. Object into m fragments. Recode into n fragments. A rate r = m/n code. Increases storage by 1/r. Key –reconstruct from any m. E.g. –r = ¼, m = 16, n = 64 fragments. –Increases storage by four.

4 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:4 Assumptions OceanStore:collection of independently failing disks. Failed disks replaced by new, blank ones. Each fragment placed on a unique, randomly selected disk. –For a given block. A repair epoch. –Time period between a global sweep, where a repair process scans the system, attempting to restore redundancy.

5 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:5 Availability Exploit statistical stability from a large number number of components E.g. given 90% of a million machines availability: –2 replicas yield 2 9’s of availability. –16 fragments yield 5 9’s of availability. –32 fragments yield 8 9’s of availability. “More than 6’s of availability requires world peace.” –Steve Gribble, 2001.

6 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:6 Durability E.g. MTTF block = 10 35 years for a particular block. –n = 64, r = ¼ and repair epoch e = 6 months. –MTTF block = 35 years for replication. Same storage cost and repair epoch! Need 36 replicas for MTTF block = 10 35 years for a particular block.

7 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:7 Erasure Coding vs.Replication Fix storage overhead and repair epoch. –MTTF for Erasure codes orders of magnitude higher. Fix MTTF system and repair epoch. –Storage, BW, and disk seeks for Erasure codes a magnitude lower. Storage replica /Storage erasure = R * r BW replica /Bw erasure = R * r DiskSeeks replica /DiskSeeks erasure = R /n – = R * r with smart storage server. E.g. –2 16 users, 35 MB/hr/user  10 17 blocks want MTTF system = 10 20 years. –R = 22 replicas or r = m/n = 32/64, Repair epoch = 4 months. Storage replica /Storage erasure = 11 BW replica /BW erasure = 11 DiskSeeks replica /DiskSeeks erasure = 11 best case or 0.29 worst case.

8 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:8 Requirements Can this be real? –Three requirements must be met: Failure Independence. Data Integrity. Efficient Repair.

9 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:9 Failure Independence Model Model Builder. –Various sources. –Model failure correlation. Set Creator. –Queries random nodes. –Dissemination Sets. Storage servers that fail with low correlation. Disseminator. –Sends fragments to members of set. Model Builder Set Creator Introspection Human Input Network Monitoring model Disseminator set probe type fragments Storage Server

10 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:10 Model Builder I Correlation of failure among types of storage servers. –Type enumeration of server properties. Collects availability statistics on storage server types –Compute marginal and pair-wise joint probabilities of failure.

11 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:11 Model Builder II Mutual Information –computes pairwise correlation of to node failures. –I(X,Y) = H(X) – H(X|Y) –I(X,Y) = H(X) + H(Y) – H(X,Y) I(X,Y) is mutual information H(X) is the entropy of X. –Entropy is a measure of randomness. –I(X,Y) is the reduction of entropy of Y given X. –E.g. X,Y  up, down H(X|Y) H(Y|X) I(X,Y)I(X,Y) H(X)H(Y)H(X,Y)

12 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:12 Model Builder III Implemented mutual Information in Matlab. –Used fault load from [1] to compute Mutual Information –Reflects the rate of failures of the underlying network. Measure interconnectivity was among eight sites: –Verio(CA), Rackspace(TX), Rackspace(U.K.), University of Cali- fornia, Berkeley, University of California, San Diego, University of Utah, University of Texas, Austin, and Duke. Test was six days. Day one had highest failure rate of 0.17%. –Results Different levels of service availability. –Some site fault loads had same average failure rates. –Timing and nature of failures. Sites fail with low correlation according to Mutual Information. »[1] Yu and Vahdat. The Costs and Limits of Availability for Replicated Services. SOSP 2001.computes pairwise correlation of to nodes.

13 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:13 Set Creator I Node Discovery –Collects information on a large set of storage servers. –Uses properties of Tapestry. –Scan node address space. Node address space sparse. Node names random. –Willing servers respond with signed statement of type. Set Creation –Clusters servers that fail with high correlation. –Create dissemination set. Servers that fail with low correlation.

14 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:14 Disseminator I Disseminator Server Architecture –Requests to archive objects recv’d thru network layers. –Consistency mechanisms decides to archive obj. Asynchronous DiskAsynchronous Network Network Operating System Java Virtual Machine Thread Scheduler X Y Consistency Location & Routing Disseminator Introspection Modules DispatchDispatch 4 2 3 1 4

15 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:15 SEDA I Stage Event-Driven Architecture (SEDA) as Server. –by Matt Welsh High Concurrency –Similar to traditional event-driven design. Load Conditioning –Drop, filter, or reorder events. Code modularity –Stages are developed and maintained independently. Scalable I/O –Asynchronous I/O –Network and disk. Debugging Support –Queue length profiler. –Event flow trace.

16 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:16 SEDA II Stage –a self-contained application component consisting of: event handler. incoming event queue thread pool. –Each stage is managed by a controller affects scheduling and resource allocation. Operation –Thread pulls event off of the stage’s incoming event queue. –Invokes supplied event handler. –Event handler processes each task, and dispatches zero or more tasks by enqueuing them on the event queues of other stages. Event Handler Thread Pool Controller Event Queue Outgoing Events

17 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:17 Disseminator Control Flow GenerateChkpt Stage GenerateFrags Stage Disseminator Stage Consistency Stage(s) GenerateFragsChkptReq DisseminateFragsReq GenerateFragsReq GenerateFragsResp GenerateFragsChkptResp Send Frags to Storage Servers DisseminateFragsResp Cache Stage BlkResp SpaceResp BlkReq SpaceReq Req

18 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:18 Storage Control Flow Storage Stage Cache Stage Storage Req StoreReq Send MAC’d Ack Disk Stage BufferdStoreReq

19 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:19 Performance I Focus –Performance an OceanStore server in archiving an object. –restrict analysis to only the operations of archiving or recoalescing and object. Do not analyze the network phases of disseminating or requesting fragments. Performance of the Archival Layer. –OceanStore server were analyzed on a single processor. –850 MHz Pentium III machine with 768 MB of memory –running Debian distribution of the Linux 2.4.1 kernel. –Used BerkeleyDB when reading and writing blocks to disk. –Simulated a number of independent event streams. Read or write.

20 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:20 Performance II Throughput –Users created traffic without delay. –Archive ~30 req/sec. –Remains constant. Turnaround time. –Response time. User perceived latency. –Increases linearly with the number of events.

21 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:21 Discussion Distribute the creation of models. Byzantine commit on dissemination sets. Cache –“Old hat” LRU, second chance alg., free list, multiple databases, etc. –Critical to performance of server.

22 ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:22 Issues Number of disk heads needed. Are erasure codes good for streaming media? –Caching Layer. delete. –Eradication is antithetical to durability! –If you can eradicate something, then so can someone else! (denial of service) –Must have “eradication certificate” or similar


Download ppt "Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley."

Similar presentations


Ads by Google