Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley.

Slides:



Advertisements
Similar presentations
What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Advertisements

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
Availability in Globally Distributed Storage Systems
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.
Computer Systems/Operating Systems - Class 8
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
Distributed Cluster Repair for OceanStore Irena Nadjakova and Arindam Chakrabarti Acknowledgements: Hakim Weatherspoon John Kubiatowicz.
Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:
Erasure Coding vs. Replication: A Quantiative Comparison
Concurrency, Threads, and Events Robbert van Renesse.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Copyright ©2009 Opher Etzion Event Processing Course Engineering and implementation considerations (related to chapter 10)
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Informationsteknologi Tuesday, October 9, 2007Computer Systems/Operating Systems - Class 141 Today’s class Scheduling.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services by, Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Application-Layer Anycasting By Samarat Bhattacharjee et al. Presented by Matt Miller September 30, 2002.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
CH2 System models.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
6 Memory Management and Processor Management Management of Resources Measure of Effectiveness – On most modern computers, the operating system serves.
Computing Infrastructure for Large Ecommerce Systems -- based on material written by Jacob Lindeman.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.
Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture notes Dr.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
SEDA An architecture for Well-Conditioned, scalable Internet Services Matt Welsh, David Culler, and Eric Brewer University of California, Berkeley Symposium.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
Seminar On Rain Technology
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Introduction to Operating Systems Concepts
SEDA: An Architecture for Scalable, Well-Conditioned Internet Services
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Distributed Systems – Paxos
Operating System (OS) QUESTIONS AND ANSWERS
Real-time Software Design
Making the Archive Real
Providing Secure Storage on the Internet
Outline Midterm results summary Distributed file systems – continued
Multiprocessor and Real-Time Scheduling
Chapter 2: Operating-System Structures
CS703 - Advanced Operating Systems
THE GOOGLE FILE SYSTEM.
Database System Architectures
Chapter 2 Operating System Overview
Chapter 2: Operating-System Structures
Presentation transcript:

Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:2 Questions About Information: Where is persistent information stored? –Want: Geographic independence for availability, durability, and freedom to adapt to circumstances

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:3 m of n Encoding Redundant Fragments Data / Object Fragment Received Fragment Not Received Redundancy without overhead of replication. Object into m fragments. Recode into n fragments. A rate r = m/n code. Increases storage by 1/r. Key –reconstruct from any m. E.g. –r = ¼, m = 16, n = 64 fragments. –Increases storage by four.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:4 Assumptions OceanStore:collection of independently failing disks. Failed disks replaced by new, blank ones. Each fragment placed on a unique, randomly selected disk. –For a given block. A repair epoch. –Time period between a global sweep, where a repair process scans the system, attempting to restore redundancy.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:5 Availability Exploit statistical stability from a large number number of components E.g. given 90% of a million machines availability: –2 replicas yield 2 9’s of availability. –16 fragments yield 5 9’s of availability. –32 fragments yield 8 9’s of availability. “More than 6’s of availability requires world peace.” –Steve Gribble, 2001.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:6 Durability E.g. MTTF block = years for a particular block. –n = 64, r = ¼ and repair epoch e = 6 months. –MTTF block = 35 years for replication. Same storage cost and repair epoch! Need 36 replicas for MTTF block = years for a particular block.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:7 Erasure Coding vs.Replication Fix storage overhead and repair epoch. –MTTF for Erasure codes orders of magnitude higher. Fix MTTF system and repair epoch. –Storage, BW, and disk seeks for Erasure codes a magnitude lower. Storage replica /Storage erasure = R * r BW replica /Bw erasure = R * r DiskSeeks replica /DiskSeeks erasure = R /n – = R * r with smart storage server. E.g. –2 16 users, 35 MB/hr/user  blocks want MTTF system = years. –R = 22 replicas or r = m/n = 32/64, Repair epoch = 4 months. Storage replica /Storage erasure = 11 BW replica /BW erasure = 11 DiskSeeks replica /DiskSeeks erasure = 11 best case or 0.29 worst case.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:8 Requirements Can this be real? –Three requirements must be met: Failure Independence. Data Integrity. Efficient Repair.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:9 Failure Independence Model Model Builder. –Various sources. –Model failure correlation. Set Creator. –Queries random nodes. –Dissemination Sets. Storage servers that fail with low correlation. Disseminator. –Sends fragments to members of set. Model Builder Set Creator Introspection Human Input Network Monitoring model Disseminator set probe type fragments Storage Server

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:10 Model Builder I Correlation of failure among types of storage servers. –Type enumeration of server properties. Collects availability statistics on storage server types –Compute marginal and pair-wise joint probabilities of failure.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:11 Model Builder II Mutual Information –computes pairwise correlation of to node failures. –I(X,Y) = H(X) – H(X|Y) –I(X,Y) = H(X) + H(Y) – H(X,Y) I(X,Y) is mutual information H(X) is the entropy of X. –Entropy is a measure of randomness. –I(X,Y) is the reduction of entropy of Y given X. –E.g. X,Y  up, down H(X|Y) H(Y|X) I(X,Y)I(X,Y) H(X)H(Y)H(X,Y)

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:12 Model Builder III Implemented mutual Information in Matlab. –Used fault load from [1] to compute Mutual Information –Reflects the rate of failures of the underlying network. Measure interconnectivity was among eight sites: –Verio(CA), Rackspace(TX), Rackspace(U.K.), University of Cali- fornia, Berkeley, University of California, San Diego, University of Utah, University of Texas, Austin, and Duke. Test was six days. Day one had highest failure rate of 0.17%. –Results Different levels of service availability. –Some site fault loads had same average failure rates. –Timing and nature of failures. Sites fail with low correlation according to Mutual Information. »[1] Yu and Vahdat. The Costs and Limits of Availability for Replicated Services. SOSP 2001.computes pairwise correlation of to nodes.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:13 Set Creator I Node Discovery –Collects information on a large set of storage servers. –Uses properties of Tapestry. –Scan node address space. Node address space sparse. Node names random. –Willing servers respond with signed statement of type. Set Creation –Clusters servers that fail with high correlation. –Create dissemination set. Servers that fail with low correlation.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:14 Disseminator I Disseminator Server Architecture –Requests to archive objects recv’d thru network layers. –Consistency mechanisms decides to archive obj. Asynchronous DiskAsynchronous Network Network Operating System Java Virtual Machine Thread Scheduler X Y Consistency Location & Routing Disseminator Introspection Modules DispatchDispatch

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:15 SEDA I Stage Event-Driven Architecture (SEDA) as Server. –by Matt Welsh High Concurrency –Similar to traditional event-driven design. Load Conditioning –Drop, filter, or reorder events. Code modularity –Stages are developed and maintained independently. Scalable I/O –Asynchronous I/O –Network and disk. Debugging Support –Queue length profiler. –Event flow trace.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:16 SEDA II Stage –a self-contained application component consisting of: event handler. incoming event queue thread pool. –Each stage is managed by a controller affects scheduling and resource allocation. Operation –Thread pulls event off of the stage’s incoming event queue. –Invokes supplied event handler. –Event handler processes each task, and dispatches zero or more tasks by enqueuing them on the event queues of other stages. Event Handler Thread Pool Controller Event Queue Outgoing Events

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:17 Disseminator Control Flow GenerateChkpt Stage GenerateFrags Stage Disseminator Stage Consistency Stage(s) GenerateFragsChkptReq DisseminateFragsReq GenerateFragsReq GenerateFragsResp GenerateFragsChkptResp Send Frags to Storage Servers DisseminateFragsResp Cache Stage BlkResp SpaceResp BlkReq SpaceReq Req

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:18 Storage Control Flow Storage Stage Cache Stage Storage Req StoreReq Send MAC’d Ack Disk Stage BufferdStoreReq

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:19 Performance I Focus –Performance an OceanStore server in archiving an object. –restrict analysis to only the operations of archiving or recoalescing and object. Do not analyze the network phases of disseminating or requesting fragments. Performance of the Archival Layer. –OceanStore server were analyzed on a single processor. –850 MHz Pentium III machine with 768 MB of memory –running Debian distribution of the Linux kernel. –Used BerkeleyDB when reading and writing blocks to disk. –Simulated a number of independent event streams. Read or write.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:20 Performance II Throughput –Users created traffic without delay. –Archive ~30 req/sec. –Remains constant. Turnaround time. –Response time. User perceived latency. –Increases linearly with the number of events.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:21 Discussion Distribute the creation of models. Byzantine commit on dissemination sets. Cache –“Old hat” LRU, second chance alg., free list, multiple databases, etc. –Critical to performance of server.

ROC/Oceanstore©2002 Hakim Weatherspoon/UC BerkeleyOceanStore Archive:22 Issues Number of disk heads needed. Are erasure codes good for streaming media? –Caching Layer. delete. –Eradication is antithetical to durability! –If you can eradicate something, then so can someone else! (denial of service) –Must have “eradication certificate” or similar