Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Slides:



Advertisements
Similar presentations
Depot: Cloud Storage with Minimal Trust OSDI 2010 Prince Mahajan, Srinath Setty, Sangmin Lee, Allen Clement, Lorenzo Alvisi, Mike Dahlin, and Michael Walfish.
Advertisements

The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
1 CSIS 7102 Spring 2004 Lecture 8: Recovery (overview) Dr. King-Ip Lin.
E-Transactions: End-to-End Reliability for Three-Tier Architectures Svend Frølund and Rachid Guerraoui.
Salt: Combining ACID and BASE in a Distributed Database
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
CS 582 / CMPE 481 Distributed Systems
Persistent State Service 1 Distributed Object Transactions  Transaction principles  Concurrency control  The two-phase commit protocol  Services for.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
Case Study - GFS.
Team CMD Distributed Systems Team Report 2 1/17/07 C:\>members Corey Andalora Mike Adams Darren Stanley.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Distributed Deadlocks and Transaction Recovery.
1 The Google File System Reporter: You-Wei Zhang.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Strong Security for Distributed File Systems Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel.
Practical Byzantine Fault Tolerance
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Byzantine fault tolerance
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
Storage Systems CSE 598d, Spring 2007 Rethink the Sync April 3, 2007 Mark Johnson.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Transactions on Replicated Data Steve Ko Computer Sciences and Engineering University at Buffalo.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Free Transactions with Rio Vista Landon Cox April 15, 2016.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Remote Backup Systems.
Free Transactions with Rio Vista
Distributed Systems – Paxos
CSE 486/586 Distributed Systems Consistency --- 1
Introduction to NewSQL
Two phase commit.
Providing Secure Storage on the Internet
Outline Announcements Fault Tolerance.
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Fault Tolerance Distributed Web-based Systems
CSE 486/586 Distributed Systems Consistency --- 1
Free Transactions with Rio Vista
From Viewstamped Replication to BFT
CSE 486/586 Distributed Systems Concurrency Control --- 3
Distributed Transactions
Distributed Systems CS
Remote Backup Systems.
CSE 486/586 Distributed Systems Concurrency Control --- 3
Presentation transcript:

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin 1

Storage: Do not lose data Fsck, IRON, ZFS, … Past Now & Future 2

Salus: Provide robustness to scalable systems Scalable systems: Crash failure + certain disk corruption (GFS/Bigtable, HDFS/HBase, WAS, …) Strong protections: Arbitrary failure (BFT, End-to-end verification, …) Salus 3

Start from “Scalable” 4 File system Key value Database Block store Scalable Robust Key value Borrow scalable designs

Scalable architecture Metadata server Data servers Clients Parallel data transfer Data is replicated for durability and availability 5 Computing node: No persistent storage Storage nodes

Problems ? No ordering guarantees for writes Single points of failures: – Computing nodes can corrupt data. – Client can accept corrupted data. 6

1. No ordering guarantees for writes Metadata server Data servers Clients Block store requires barrier semantics

2. Computing nodes can corrupt data Metadata server Data servers Clients Single point of failure 8

3. Client can accept corrupted data Metadata server Data servers Clients Single point of failure Does a simple disk checksum work? Can not prevent errors in memory, etc. 9

Salus: Overview Metadata server Block drivers (single driver per virtual disk) Pipelined commit Active storage End-to-end verification 10 Failure model: Almost arbitrary but no impersonation

Pipelined commit Goal: barrier semantics – A request can be marked as a barrier. – All previous ones must be executed before it. Naïve solution: – Waits at a barrier: Lose parallelism Look similar to distributed transaction – Well-known solution: Two phase commit (2PC) 11

Problem of 2PC Prepared Driver Servers Leader Prepared Leader Batch i Batch i Barriers

Problem of 2PC Driver Servers Leader Batch i Batch i Commit Problem can still happen. Need ordering guarantee between different batches.

Pipelined commit Lead of batch i-1 Driver Servers Leader Commit Leader Batch i Batch i+1 Parallel data transfer 14 Batch i-1 committed Batch i committed Sequential commit

Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 15

Active storage Goal: a single node cannot corrupt data Well-known solution: BFT replication – Problem: require at least 2f+1 replicas Salus’ goal: Use f+1 computing nodes – f+1 is enough to guarantee safety – How about availability/liveness? 16

Active storage: unanimous consent Computing node Storage nodes 17

Active storage: unanimous consent Computing nodes Storage nodes Unanimous consent: – All updates must be agreed by f+1 computing nodes. Additional benefit: – Colocate computing and storage: save network bandwidth 18

Active storage: restore availability Computing nodes Storage nodes What if something goes wrong? – Problem: we may not know which one is faulty. Replace all computing nodes 19

Active storage: restore availability Computing nodes Storage nodes What if something goes wrong? – Problem: we may not know which one is faulty. Replace all computing nodes – Computing nodes do not have persistent states. 20

Discussion Does Salus achieve BFT with f+1 replication? – No. Fork can happen in a combination of failures. Is it worth using 2f+1 replicas to eliminate this case? – No. Fork can still happen as a result of other events (e.g. geo-replication). 21

Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 22

Evaluation Is Salus robust against failures? Do the additional mechanisms hurt scalability? 23

Is Salus robust against failures? 24 Always ensure barrier semantics – Experiment: Crash block driver – Result: Requests after “holes” are discarded Single computing node cannot corrupt data – Experiment: Inject errors into computing node – Result: Storage nodes reject it and trigger replacement Block drivers never accept corrupted data – Experiment: Inject errors into computing/storage node – Result: Block driver detects errors and retries other nodes

Do the additional mechanisms hurt scalability? 25

Conclusion Pipelined commit – Write in parallel – Provide ordering guarantees Active Storage – Eliminate single point of failures – Consume less network bandwidth 26 High robustness ≠ Low performance

Salus: scalable end-to-end verification Goal: read safely from one node – The client should be able to verify the reply. – If corrupted, the client retries another node. Well-known solution: Merkle tree – Problem: scalability Salus’ solution: – Single writer – Distribute the tree among servers 27

Scalable end-to-end verification Server 1 Server 2 Server 3 Server 4 Client maintains the top tree. Client does not need to store anything persistently. It can rebuild the top tree from the servers. 28

Scalable and robust storage More hardwareMore complex softwareMore failures 29

Start from “Scalable” Scalable Robust 30

Pipelined commit - challenge Is 2PC slow? – Locking? Single writer, no locking – Phase 2 network overhead? Small compare to Phase 1. – Phase 2 disk overhead? Eliminate that, push complexity to recovery Order across transactions – Easy when failure-free. – Complicate recovery. Key challenge: Recovery 31

Scalable architecture Computation node: No persistent storage Data forwarding, garbage collection, etc Tablet server (Bigtable), Region server (HBase) , Stream Manager (WAS), … 32 Computation node Storage nodes

Salus: Overview Block store interface (Amazon EBS): – Volume: a fixed number of fixed-size blocks – Single block driver per volume: barrier semantic Failure model: arbitrary but no identity forge Key ideas: – Pipelined commit – write in parallel and in order – Active storage – replicate computation nodes – Scalable end-to-end checks – client verifies replies 33

Complex recovery Still need to guarantee barrier semantic. What if servers fail concurrently? What if different replicas diverge? 34

Support concurrent users No perfect solution now Give up linearizability – Casual or Fork-Join-Casual (Depot) Give up scalability – SUNDR 35

How to handle storage node failures? Computing nodes generate certificates when writing to storage nodes If a storage node fails, computing nodes agree on the current states Approach: look for longest and valid prefix After that, re-replicate data As long as one correct storage node is available, no data loss 36

Why barrier semantics? Goal: provide atomicity Example (file system): to append to a file, we need to modify a data block and the corresponding pointer to that block It’s bad if pointer is modified but data is not. Solution: modify data first and then pointer. Pointer is attached with a barrier. 37

Cost of robustness 38 Robustness Cost Crash Salus Eliminate fork Fully BFT Use 2f+1 replication Use signature and N- way programming