Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Slides:



Advertisements
Similar presentations
P. Hunt, M Konar, F. Junqueira, B. Reed Presented by David Stein for ECE598YL SP12.
Advertisements

Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Depot: Cloud Storage with Minimal Trust OSDI 2010 Prince Mahajan, Srinath Setty, Sangmin Lee, Allen Clement, Lorenzo Alvisi, Mike Dahlin, and Michael Walfish.
High throughput chain replication for read-mostly workloads
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
1 CSIS 7102 Spring 2004 Lecture 8: Recovery (overview) Dr. King-Ip Lin.
E-Transactions: End-to-End Reliability for Three-Tier Architectures Svend Frølund and Rachid Guerraoui.
Salt: Combining ACID and BASE in a Distributed Database
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
TRANSACTION PROCESSING TECHNIQUES BY SON NGUYEN VIJAY RAO.
Distributed storage for structured data
Case Study - GFS.
Team CMD Distributed Systems Team Report 2 1/17/07 C:\>members Corey Andalora Mike Adams Darren Stanley.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
Distributed File System By Manshu Zhang. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Practical Byzantine Fault Tolerance
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
Caching Consistency and Concurrency Control Contact: Dingshan He
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Transactions on Replicated Data Steve Ko Computer Sciences and Engineering University at Buffalo.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
St. Petersburg, 2016 Openstack Disk Storage vs Amazon Disk Storage Computing Clusters, Grids and Cloud Erasmus Mundus Master Program in PERCCOM Author:
Detour: Distributed Systems Techniques
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Distributed File Systems
Distributed Systems – Paxos
Introduction to NewSQL
Two phase commit.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Outline Announcements Fault Tolerance.
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
A Redundant Global Storage Architecture
Printed on Monday, December 31, 2018 at 2:03 PM.
EEC 688/788 Secure and Dependable Computing
From Viewstamped Replication to BFT
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Concurrency Control --- 3
Lecture 21: Replication Control
Decoupled Storage: “Free the Replicas!”
Distributed Systems CS
CS639: Data Management for Data Science
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
CSE 486/586 Distributed Systems Concurrency Control --- 3
Transaction Communication
Presentation transcript:

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin 1

Storage: do not lose data Fsck, IRON, ZFS, … Past Now & Future Are they robust enough? 2

Salus: provide robustness to scalable systems Scalable systems: crash failure + certain disk corruption (GFS/Bigtable, HDFS/HBase, WAS, FDS, …) Strong protections: arbitrary failure (BFT, End-to-end verification, …) Salus 3

Start from “Scalable” 4 File system Key value Database Block store Scalable Robust

Scalable architecture Metadata server Data servers Clients Infrequent metadata transfer Parallel data transfer Data is replicated for durability and availability Computation nodes: No persistent storage Data forwarding, garbage collection, etc Tablet server (Bigtable), Region server (HBase) , Stream Manager (WAS), … 5

Challenges ? Parallelism vs consistency Prevent computation nodes from corrupting data Read safely from one node 6 Do NOT increase replication cost

Parallelism vs Consistency Metadata server Data servers Clients Write 1 Write 2 Barrier Block store requires barrier semantic. 7

Prevent computation nodes from corrupting data Metadata server Data servers Clients Single point of failure 8

Read safely from one node Metadata server Data servers Clients Single point of failure Does simple checksum work? Can not prevent errors in control flow. 9

Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 10

Salus’ overview Disk-like interface (Amazon EBS): – Volume: a fixed number of fixed-size blocks – Single client per volume: barrier semantic Failure model: arbitrary but no identity forge Key ideas: – Pipelined commit – write in parallel and in order – Active storage – safe write – Scalable end-to-end checks – safe read 11

Salus’ key ideas Metadata server Clients Pipelined commit Active storage End-to-end verification 12

Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 13

Pipelined commit Goal: barrier semantic – A request can be marked as a barrier. – All previous ones must be executed before it. Naïve solution: – The client blocks at a barrier: lose parallelism A weaker version of distributed transaction – Well-known solution: two phase commit (2PC) 14

Pipelined commit – 2PC Previous leader Prepared Batch i-1 committed Client Servers Leader Prepared Leader Batch i Batch i

Pipelined commit – 2PC Previous leader Batch i-1 committed Client Servers Leader Commit Batch i committed Commit Leader Batch i Batch i+1 Parallel data transfer Sequential commit 16

Key difference from 2PC Previous leader Batch i-1 committed Client Servers Leader Commit Batch i committed Commit Leader Batch i Batch i+1 17

Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 18

Salus: active storage Goal: a single node cannot corrupt data Well-known solution: BFT replication – Problem: higher replication cost Salus’ solution: use f+1 computation nodes – Require unanimous consent – How about availability/liveness? 19

Active storage: unanimous consent Computation nodes Storage nodes 20

Active storage: unanimous consent Computation nodes Storage nodes Unanimous consent: – All updates must be agreed by f+1 computation nodes. Additional benefit: – Collocate computation and storage: save network bw 21

Active storage: restore availability Computation nodes Storage nodes What if something goes wrong? – Problem: we may not know which one is faulty. Replace all computation nodes 22

Active storage: restore availability Computation nodes Storage nodes What if something goes wrong? – Problem: we may not know which one is faulty. Replace all computation nodes – Computation nodes do not have persistent states. 23

Discussion Does Salus achieve BFT with f+1 replication? – No. Fork can happen in a combination of failures. Is it worth using 2f+1 replicas to eliminate this case? – No. Fork can still happen as a result of geo-replication. How to support multiple writers? – We do not know a perfect solution. – Give up either linearizability or scalability. 24

Outline Challenges Salus – Pipelined commit – Active storage – Scalable end-to-end checks Evaluation 25

Evaluation 26

Evaluation 27

Evaluation 28

Outline Salus’ overview Pipelined commit – Write in order and in parallel Active storage: – Prevent a single node from corrupting data Scalable end-to-end verification: – Read safely from one node Evaluation 29

Salus: scalable end-to-end verification Goal: read safely from one node – The client should be able to verify the reply. – If corrupted, the client retries another node. Well-known solution: Merkle tree – Problem: scalability Salus’ solution: – Single writer – Distribute the tree among servers 30

Scalable end-to-end verification Server 1 Server 2 Server 3 Server 4 Client maintains the top tree. Client does not need to store anything persistently. It can rebuild the top tree from the servers. 31

Scalable and robust storage More hardwareMore complex softwareMore failures 32

Start from “Scalable” Scalable Robust 33

Pipelined commit - challenge Is 2PC slow? – Locking? Single writer, no locking – Phase 2 network overhead? Small compare to Phase 1. – Phase 2 disk overhead? Eliminate that, push complexity to recovery Order across transactions – Easy when failure-free. – Complicate recovery. Key challenge: Recovery 34