StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
Distributed Storage March 12, Distributed Storage What is Distributed Storage?  Simple answer: Storage that can be shared throughout a network.
1 Disk Based Disaster Recovery & Data Replication Solutions Gavin Cole Storage Consultant SEE.
Transaction.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance August 2008 Hakim Weatherspoon, Lakshmi Ganesh, Tudor.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
Learning from the Past for Resolving Dilemmas of Asynchrony Paul Ezhilchelvan and Santosh Shrivastava Newcastle University England, UK.
Module – 12 Remote Replication
Wide-area cooperative storage with CFS
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
University of Pennsylvania 11/21/00CSE 3801 Distributed File Systems CSE 380 Lecture Note 14 Insup Lee.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
6.4 Data and File Replication Gang Shen. Why replicate  Performance  Reliability  Resource sharing  Network resource saving.
Remote Replication Chapter 14(9.3) ISMDR:BEIT:VIII:chap9.3:Madhu N:PIIT1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
Implementing Multi-Site Clusters April Trần Văn Huệ Nhất Nghệ CPLS.
Networked File System CS Introduction to Operating Systems.
Dr. Cesar Malave Texas A & M University
© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.
ALICE data access WLCG data WG revival 4 October 2013.
Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance Hakim Weatherspoon Joint with Lakshmi Ganesh, Tudor.
High Availability in Clustered Multimedia Servers Renu Tewari Daniel M. Dias Rajat Mukherjee Harrick M. Vin.
Distributed File Systems
JOURNALING VERSUS SOFT UPDATES: ASYNCHRONOUS META-DATA PROTECTION IN FILE SYSTEMS Margo I. Seltzer, Harvard Gregory R. Ganger, CMU M. Kirk McKusick Keith.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Serverless Network File Systems Overview by Joseph Thompson.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
EXT2C: Increasing Disk Reliability Brian Pellin, Chloe Schulze CS736 Presentation May 3 th, 2005.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
JOURNALING VERSUS SOFT UPDATES: ASYNCHRONOUS META-DATA PROTECTION IN FILE SYSTEMS Margo I. Seltzer, Harvard Gregory R. Ganger, CMU M. Kirk McKusick Keith.
Virtual Machine Movement and Hyper-V Replica
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 27 – Media Server (Part 2) Klara Nahrstedt Spring 2009.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Determining BC/DR Methods
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
RAID RAID Mukesh N Tekwani
Overview Continuation from Monday (File system implementation)
A Redundant Global Storage Architecture
RAID RAID Mukesh N Tekwani April 23, 2019
Network File System (NFS)
Designing Database Solutions for SQL Server
Presentation transcript:

StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003 USENIX Annual Technical Conference Presenter: D 林敬棋

Introduction Important data need to be protected. ◦ Making replicas. Replication on remote sites ◦ Reduce the amount of data lost in failure. ◦ Decrease the time required to recover from catastrophic site failure.

StarFish A highly-available geographically-dispersed block storage system. ◦ Does not require expensive dedicated communication lines to all replicas to achieve highly-available. ◦ Achieves good performance even during recovery from a replica failure. ◦ Single-owner access semantics.

Architecture StarFish consists of ◦ One Host Element(HE)  Provides storage virtualization and read cache. ◦ N Storage Element(SE)  Q: write quorum size.  Synchronous updates to a quorum of Q SEs, and asynchronous updates to the rest.

Recommended Setup N = 3, Q = 2 MAN: Metropolitan Area Network WAN:Wide Area Network

Another Deployment

SE Recovery Write log ◦ HE keeps a circular buffer of recent writes. ◦ Each SE maintains a circular buffer of recent writes on a log disk. Three types of recovery ◦ Quick recovery ◦ Replay recovery ◦ Full recovery

Availability and Reliability Assume that the failure and recovery processes of the network links and SEs are i.i.d Poisson processes with combined mean failure and recovery rates of λ and μ per second. Similarly, the HE has Poisson-distributed λ he and μ he.

Availability The steady-state probability that at least Q SEs are available. Derived from the standard machine repairman mode.

Machine Repairman Model

Availability(cont.)

Availability(cont.)  X ★ 9 : the number of 9s in an availability measure. Achieve a much higher availability when N = 2Q + 1. For fixed N, availability decrease with larger quorum size. ◦ Increasing quorum size trades off availability for reliability.

Reliability The probability of no data loss. The reliability increases with larger Q. Two approaches ◦ Make Q > floor(N/2) and at least Q SEs are available.  Reduce availability and performance. ◦ Read-only consistency

Read-only Consistency Available in read-only mode during failure. ◦ Read-only mode obviates the need for Q SEs to be available to handle updates. ◦ Increase availability

Availability with Read-only Consistency

Observations If ρ he = 0, availability is independent of Q. ◦ Can always recover from HE. If ρ he increase, availability increase with Q. Largest increase occurs from Q = 1 to Q = 2, and bounded by 3/16 when ρ = 1. ◦ Diminishing gain after Q = 2. ◦ Suggest Q = 2 in practical system.

Availability with Read-only Consistency(cont.) N < 2 Q

Implementation

Performance Measurements Compares with a direct-attached RAID unit.

Settings Different network delays ◦ 1, 2, 4, 8, 23, 36, 65 ms Different bandwidth limitations ◦ 31, 51, 62, 93, 124 Mb/s. Benchmark: ◦ Micro-benchmark  Read hit  Read miss  Write ◦ PostMark

Effects of network delays and HE cache size Near SE delay: 4ms; Far SE delay: 8ms No cache miss if HE cache size = 400 MB

Observation Large HE cache improves performance. ◦ HE can respond to more read requests without communicating with SE.  Does not change write requests. ◦ Especially beneficial when local SE has significant delays. Q = 2 and 400MB cache size is not influenced by the delay to local SE. ◦ Depend on near SE.

Normal Operation and placement of the far SE  1-8: 1, 2, 4, 8 ms; 4-12: 4, 8, 12 ms  23-65: 23, 36, 65 ms; : 31,51,62,93,124 Mbps  Local SE delay: 0ms N = 3

Normal Operation and placement of the far SE(Cont.) N = 3 8 threads

Normal Operation and placement of the far SE(Cont.)

Observation Performance is influenced mostly by two parameters ◦ Write quorum size ◦ Delay to the SE. StarFish can provide adequate performance when one of the SEs is placed in a remote location. ◦ At least 85% of the performance of a direct- attached RAID.

Recovery Performance degrades more during full recovery.

Conclusion The StarFish system reveals significant benefits from a third copy of the data at an intermediate distance. A StarFish system with 3 replicas, a write quorum size of 2, and read-only consistency yields better than % availability assuming individual Storage Element availability of 99%.