Making the Archive Real

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.

What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.

POND: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao and John Kubiatowicz UC, Berkeley File and Storage.

Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.

Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte.

1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,

Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.

The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.

OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01 John Kubiatowicz University of California at Berkeley.

Distributed Cluster Repair for OceanStore Irena Nadjakova and Arindam Chakrabarti Acknowledgements: Hakim Weatherspoon John Kubiatowicz.

Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley.

Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Tentative Updates in MINO Steven Czerwinski Jeff Pang Anthony Joseph John Kubiatowicz ROC Winter Retreat January 13, 2002.

Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.

OceanStore: Data Security in an Insecure world John Kubiatowicz.

Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.

Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.

OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley

Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.

OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage Anthony D. Joseph John Kubiatowicz Sahara Retreat, January 2003.

Wide-area cooperative storage with CFS

OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-

Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.

Case Study - GFS.

Distributed File Systems Sarah Diesburg Operating Systems CS 3430.

PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.

Page 19/4/2015 CSE 30341: Operating Systems Principles Raid storage  Raid – 0: Striping  Good I/O performance if spread across disks (equivalent to n.

OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Fault Tolerance via the State Machine Replication Approach Favian Contreras.

© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.

Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.

Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.

Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

Threshold Phenomena and Fountain Codes Amin Shokrollahi EPFL Joint work with M. Luby, R. Karp, O. Etesami.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.

Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Database Laboratory Regular Seminar TaeHoon Kim Article.

CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.

OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.

Introduction to Operating Systems Concepts

A Tale of Two Erasure Codes in HDFS

Persistence of Data in a Dynamic Unreliable Network

Distributed File Systems

Pastry Scalable, decentralized object locations and routing for large p2p systems.

SOFTWARE DESIGN AND ARCHITECTURE

RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne

File System Implementation

CHAPTER 3 Architectures for Distributed Systems

View Change Protocols and Reconfiguration

湖南大学-信息科学与工程学院-计算机与科学系

Providing Secure Storage on the Internet

OceanStore: Data Security in an Insecure world

John D. Kubiatowicz UC Berkeley

Pond: the OceanStore Prototype

OceanStore: An Architecture for Global-Scale Persistent Storage

CMPE 252A : Computer Networks

Specialized Cloud Architectures

Chapter 2: The Linux System Part 5

Replica Placement Heuristics of Application-level Multicast

View Change Protocols and Reconfiguration

THE GOOGLE FILE SYSTEM.

Content Distribution Network

Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.

Presentation transcript:

Making the Archive Real My name is HW and this is joint work with CW and JK at UC Berkeley. And I am going to primarily talk about the OceanStore Archival Layer, except when it is appropriate to show the relationship to other components, like the inner ring or secondary replicas. Hakim Weatherspoon University of California, Berkeley ROC/OceanStore Retreat, Lake Tahoe. Tuesday, June 11, 2002

©2002 Hakim Weatherspoon/UC Berkeley Archival Storage Every version of every object is archived. At least in principle. How long does it take to… Write? Read? Can the Archive be run inline? Or, is it best to batch? This talk is really a war story talk about how we fight for principle. In OceanStore, every version of every object is archived (at least in principle). This means that the archival process overhead has to be essentially free (performance wise) or minimal Overhead over the base update process. Specifically, I am going to talk about how long does it take to write/read to/from the Archive. Where can we improve performance, where are the bottlenecks etc. In standing by our principle can we archive every update inline, or is it best to batch archival requests. And what are the tradeoffs/consequences. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Outline Archival Process context. Archival Modes. Archival Performance. Future Directions ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Path of an Update As an example, imagine a very popular file, like a popular MP3, has been cached and replicated throughout the infrastructure. These purple bubbles represent the cached and replicated file and the orange bubbles represent a Byzantine ring responsible for committing changes to the file. Suppose now that the author wants to add a new verse and modify the file. She would submit her changes to the infrastructure that would go to the an inner ring that would perform a Byzantine agreement on the changes and multicast down the changes to each replica. This situation should be of concern because each replica can respond to a request for the file, so you need some way to verify the results returned, also, each replica can come and go as it pleases, so again, where is the persistent storage ? Verify data. Persistence storage. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Archival Dissemination Built into Update Erasure codes redundancy without overhead of strict replication produce n fragments, where any m is sufficient to reconstruct data. m < n. rate r = m/n. Storage overhead is 1/r. Erasure codes are a form of redundancy without the overhead of strict replication. Erasure codes work by producing n fragments, where any m is sufficient to reconstruct the data and m is strictly less than n. We call m over n the rate of encoding. The observation is that storage is only increased by the inverse of the rate. E.g. m = 16. n = 64 increases storage by 4. We have pictured building this redundancy step into the update process. Again the modification to the mp3 file is submitted to the network, a Byzantine agreement is performed and right away fragments from the erasure codes are disseminated. Note that the fragments are much smaller than the actual data. To keep the system durable, you could imagine a repair process that goes by and replaces lost redundancy/fragments. Now lets say that you want to keep the system durable with four replicas. How can you structure redundancy to comply with this restraint. Assume repair process. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Durability Fraction of Blocks Lost Per Year (FBLPY)* r = ¼, erasure-encoded block. (e.g. m = 16, n = 64) Increasing number of fragments, increases durability of block Same storage cost and repair time. n = 4 fragment case is equivalent to replication on four servers. * Erasure Coding vs. Replication, H. Weatherspoon and J. Kubiatowicz, In Proc. of IPTPS 2002. You can compute the the durability of a system based only on a factor of four in storage cost for each data item. We have graphed here, the probability of losing a block vs. some repair time in months. The top line is the case of four replicas and each line shows increasing the total number of fragments while keeping the storage cost the same. Note that the y-axis is logarithmic, meaning that as you increase the number of fragments, you chances of losing a block goes down exponentially. The key liability about fragments is that you must have a a fragment whole or a complete erasure. The algorithm cannot handle corrupt fragments because you may have to attempt m choose n combinations to reconstruct the data correctly. Also, since the infrastructure is storing the fragments you need some why to verify the fragments returned. Figure 3 shows the resulting Fraction of Blocks Lost Per Year (FBLPY) for a variety of epoch lengths and number of fragments. For a given frequency of repair and number of fragments, this graph shows the fraction of all encoded blocks that are lost. To generate this graph, we construct the probability that a fragment will survive until the next epoch from the distribution of disk lifetimes in [26]4, using a method similar to computing the residual average lifetime of a randomly selected disk [6]. The probability that a block can be reconstructed is the sum of all cases in which at least m fragments survive. The FBLPY is the inverse of the mean of the geometric distribution produced from the block survival (i.e. 1/MTTF of a block). Assumptions: We assume that the archive consists of a collection of independently failing disks and that failed disks are immediately replaced by new, blank ones3. During dissemination, each archival fragment for a given block is placed on a unique, randomly selected disk. We assume the global sweep and repair process described above. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Naming and Verification Algorithm Use cryptographically secure hash algorithm to detect corrupted fragments. Verification Tree: n is the number of fragments. store log(n) + 1 hashes with each fragment. Total of n.(log(n) + 1) hashes. Top hash is a block GUID (B-GUID). Fragments and blocks are self-verifying B-GUID Hd Hd H14 Data H12 H34 H34 H1 H2 H2 H3 H4 F1 F1 F2 F3 F4 Encoded Fragments Fragment 1: Fragment 1: H2 H2 H34 H34 Hd Hd F1 - fragment data F1 - fragment data Fragment 2: H1 H34 Hd F2 - fragment data Fragment 3: H4 H12 Hd F3 - fragment data Fragment 4: H3 H12 Hd F4 - fragment data Data: H14 data ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Enabling Technology Fragments GUID Here we have a person that wants to listen to this mp3, they submit a GUID to the infrastructure. And get back fragments that can be verified and used to reconstruct the block. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Complex Objects I Verification Tree GUID of d Encoded Fragments: Unit of Archival Storage Unit of Coding So far we have only shown you how to make one block verifiable. Now we show how to make an object of arbitrary size verifiable. data ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Complex Objects II VGUID Data B -Tree M Indirect Blocks Blocks Data Now we show how to make more complex objects of arbitrary size verifiable. Asking DOLR to reconstruct top block. Can get every block in picture. If we were a library we would stop there. Version are read-only. d1 d2 d3 d4 d5 d6 d7 d8 d9 Unit of Coding Encoded Fragments: Unit of Archival Storage Verification Tree GUID of d1 ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Complex Objects III AGUID = hash{name+keys} VGUIDi VGUIDi + 1 Data B -Tree backpointer M M copy on write Indirect Blocks Storage is free/cheap. Inherently keep all versions. we can create a versioning file system where each version is tied to one AGUID (I.e. hash{name +keys} By adding a backpointer and copy-on-write capabilities, We show a Trail of versions. Where Aguid that ties VGUIDs to an AGUID. *The Pointer to head of data is tricky. Limitation/Motivate need heartbeats (signed map of Aguid to vguid). Some why of verifying the integrity of the head pointer is required for any mutable system such as a file system. copy on write Data Blocks d'8 d'9 d1 d2 d3 d4 d5 d6 d7 d8 d9 ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Mutable Data Need mutable data for real system. Entity in network. A-GUID to V-GUID mapping. Byzantine Commitment for Integrity Verifies client privileges. Creates a serial order. Atomically applies update. Versioning system Each version is inherently read-only. End result, complex objects w/ mutability. Trail of versions. Aguid that ties VGUIDs to an AGUID. Pointer to head of data is tricky. Limitation/Motivate need heartbeats (signed map of Aguid to vguid). Verifies client privileges. Atomically applies update. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Archiver I Archiver Server Architecture Requests to archive objects recv’d thru network layers. Consistency mechanisms decides to archive obj. Asynchronous Disk Asynchronous Network Network Operating System Java Virtual Machine Thread Scheduler X Y Consistency Location & Routing Archiver Introspection Modules D i s p a t c h 4 2 3 1 Look, we have built this, using the principles of event-driven programming ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Archiver Control Flow ROC/OceanStore DisseminateFragsResp Disseminator Stage Send Frags to Storage Servers GenerateFragsResp DisseminateFragsReq Req Consistency Stage(s) GenerateChkpt Stage GenerateFrags Stage GenerateFragsChkptReq GenerateFragsReq GenerateFragsChkptResp ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Archival Modes The two disks run in software RAID 0 (striping) mode using md raidtools-0.90. Used BerkeleyDB when reading and writing blocks to disk. Simulated a number of independent event streams. Read or write ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Performance Performance of the Archival Layer Performance of an OceanStore server in archiving a objects. analyze operations of archiving data (this includes signing updates in a BFT protocol). m = 16, n = 32 Experiment Environment OceanStore servers were analyzed on a 42-node cluster. Each machine in the cluster is a IBM xSeries 330 1U rackmount PC with two 1.0 GHz Pentium III CPUs 1.5 GB ECC PC133 SDRAM two 36 GB IBM UltraStar 36LZX hard drives. The machines use a single Intel PRO/1000 XF gigabit Ethernet adaptor to connect to a Packet Engines Linux 2.4.17 SMP kernel. The two disks run in software RAID 0 (striping) mode using md raidtools-0.90. Used BerkeleyDB when reading and writing blocks to disk. Simulated a number of independent event streams. Read or write ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Throughput I Data Throughput No archive 5MB/s. Delayed 3MB/s. Inlined 2.5MB/s. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Throughput II Operations Throughput No archive 55 opt/s. Small writes. Delayed 8 opt/s. Small writes. Inlined 50 opt/s. Small writes. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Latency I Archive only Y-intercept 3ms, slope 0.3s/MB. Inlined Archive = No archive + only archive. Ms/kb * 1s/1000ms * 1000kb/1MB ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Latency II No Archive. Low variance. Delayed Archive. High variance. Ms/kb * 1s/1000ms * 1000kb/1MB ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Latency III Inlined Archive. Low variance. Delayed Archive. High variance. Ms/kb * 1s/1000ms * 1000kb/1MB ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Read Latency Low median. High variance. Queuing effect. Ms/kb * 1s/1000ms * 1000kb/1MB ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Future Directions Caching for performance Automatic Replica Placement Automatic Repair Low Failure Correlation Dissemination Picture of heartbeats and repair. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Caching Automatic Replica Placement Replicas are soft-state. Can be constructed and destroyed as necessary. Prefetching Reconstruct replicas from fragments in advance of use Performance vs. durability. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Efficient Repair Global. Global Sweep and repair not efficient. Want detection/notification of node removal in system. Not as affective as distributed mechanisms. Distributed. Exploit DOLR’s distributed information and locality properties. Efficient detection and then reconstruction of fragments. 3274 4577 5544 AE87 3213 9098 1167 6003 0128 L2 L1 L3 Ring of L1 Heartbeats Global Sweep and repair not efficient. Want detection of node removal in system. Efficient detection and then reconstruction of fragments. Detection is not efficient. System should automatically: Adapt to failure. Repair itself. Incorporate new elements. Can we guarantee data is available for 1000 years? New servers added from time to time Old servers removed from time to time Everything just works Many components with geographic separation System not disabled by natural disasters Can adapt to changes in demand and regional outages Gain in stability through statistics ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Low Failure Correlation Dissemination Model Builder. Various sources. Model failure correlation. Set Creator. Queries random nodes. Dissemination Sets. Storage servers that fail with low correlation. Disseminator. Sends fragments to members of set. Introspection Model Builder Human Input Network Monitoring Set Creator model probe type fragments set Disseminator fragments Storage Server ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

©2002 Hakim Weatherspoon/UC Berkeley Conclusion Storage efficient, self-verifying mechanism. Erasure codes are good. Performance is good. Improvements forthcoming. Self-verifying data assist in Secure read-only data Secure caching infrastructures Continuous adaptation and repair For more information: http://oceanstore.cs.berkeley.edu/ ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley