Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Henry C. H. Chen and Patrick P. C. Lee
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
The Zebra Striped Network File System Presentation by Joseph Thompson.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
RAID SECTION (2.3.5) ASHLEY BAILEY SEYEDFARAZ YASROBI GOKUL SHANKAR.
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Byzantine fault tolerance
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
S-Paxos: Eliminating the Leader Bottleneck
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
Distributed File Systems 11.2Process SaiRaj Bharath Yalamanchili.
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Curator: Self-Managing Storage for Enterprise Clusters
Distributed Systems – Paxos
Alternative system models
EECS 498 Introduction to Distributed Systems Fall 2017
ICOM 6005 – Database Management Systems Design
PERSPECTIVES ON THE CAP THEOREM
From Viewstamped Replication to BFT
Chapter 2: Operating-System Structures
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
by Mikael Bjerga & Arne Lange
The SMART Way to Migrate Replicated Stateful Services
Chapter 2: Operating-System Structures
Presentation transcript:

Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin

Goal: A better storage system Data is important. Data grows bigger. Data is accessed in different ways.

Challenge: achieve multiple goals simultaneously Robustness – Durable and available despite failures Scalability – Thousands of machines or more Efficiency – Good performance with a reasonable cost

Solution Separating data and metadata

My works Gnothi Salus Exalt Evaluate Design

My works Gnothi Salus Exalt Small-scale Crash failures

My works Gnothi Salus Exalt Large-scale Arbitrary failures

How to design? Problem: Stronger protection -> Higher cost Key observation: – Data: big (4K to several MBs) – Metadata: small (tens of bytes); can validate data Solution – Strong protection for metadata -> Robustness – Minimal replication for data -> Scalability and Efficiency

How to evaluate? Gnothi Salus Exalt Evaluate large-scale storage systems on small to medium platforms

Outline Gnothi: Efficient and Available Storage Replication – Small scale; tolerate crash faults and timing errors Salus: Robust and Scalable Block Store – Large scale; tolerate arbitrary failures Exalt: Evaluate large-scale storage systems

Resolving a long-standing trade-off Efficiency – Write to f+1 nodes and read from 1 node Robustness – Availability: Aggressive timeout for failure detection – Consistency: Read returns the data of the latest write 11 Synchronous Primary Backup

Resolving a long-standing trade-off Efficiency – Write to f+1 nodes and read from 1 node Robustness – Availability: Aggressive timeout for failure detection – Consistency: Read returns the data of the latest write 12 Asynchronous Replication

Resolving a long-standing trade-off Efficiency – Write to f+1 nodes and read from 1 node Robustness – Availability: Aggressive timeout for failure detection – Consistency: Read returns the data of the latest write 13 Gnothi

Gnothi Overview Gnothi resolves the trade-off … … but only for block storage, meaning … – A fixed number of fixed-size blocks. – A request reads/writes a single block. Key ideas: – Don’t insist that nodes have identical state. – A node knows which blocks are fresh/stale. 14 Gnothi Seauton – Know yourself

Separating Data and Metadata 2f+1 nodes Clients Metadata Size: 24 bytes for a block (4K to 1M) LAN 15 Data Write request Metadata: blockNo, client ID,...

Rest of Gnothi Why is the trade-off challenging? How does Gnothi resolve the trade-off? How well does Gnothi perform? 16

Why is the trade-off challenging? 17 How to handle a timeout? Can we have both f+1 replication and short timeout? 2 2 Timeout ? ? 1 1 Synchronous Primary Backup (Remus, Hbase, Hypervisor, …) Continue with 1 node Use conservative timeout 2 2 Timeout ? ? 1 1 Asynchronous Replication (Paxos, …) Send to 2f+1 nodes and waits for f+1 ACKs 3 3

Why is the trade-off challenging? ×Continue with 1 node? – Not safe. ×Wait? – Not live. Switch to another node. (Cheap Paxos, ZZ, …) ?However, state of newly enlisted node may be incomplete. – One solution: on switch, copy all data to new node – bad availability. 2 2 Partial Replication f Timeout Wait? Copy data Switch ? ?

Rest of Gnothi Why is the trade-off challenging? How does Gnothi resolve the trade-off? How well does Gnothi perform? 19

Gnothi: Nodes can be incomplete A new write will overwrite the block anyway. Read can be processed correctly – As long as a node knows which blocks are stale Recovery can be processed correctly – As long as a node knows which block is the latest one Read block 2 I do not have current version of block Write block 2 Fetch block 2 20 Write latest version of block 2

How does Gnothi work? How to perform writes and reads efficiently when no failures occur? - Write to f+1 and read from 1 How to continue processing requests during failures? - Still write to f+1 and read from 1 How to recover the failed node efficiently? 21

How to perform writes and reads efficiently when no failures occur? Metadata Data WriteRead Maintain a single bit for each block: “do I have the current data?” Data replicated f+1 times Metadata ensures read can be processed correctly. Node 1 Node 2 Node 3 Node 1 Node 2 Node 3 Client Node with both data and metadata Node with only metadata 22 Gaios: Bolosky et al. NSDI 2011

Load-balanced Data Distribution Virtual disk Slice 1 Slice 2 Slice 3 Slice 1 Slice 3 Slice 1 Slice 2 Slice 3 Gnothi Block Drivers LAN Divide space into multiple slices Evenly distribute slices to different preferred nodes Preferred Storage Reserve Storage Node 1 Node 2 Node 3

Load-balanced Data Distribution Virtual disk Slice 1 Slice 2 Slice 3 Gnothi Block Drivers LAN Divide space into multiple slices Evenly distribute slices to different preferred nodes 24 Preferred Storage Reserve Storage Node 1 Node 2 Node

How to continue processing requests during failures? Write Do not wait for data or metadata transfer Read Metadata replicated 2f+1 times Metadata allows a node to process requests correctly. 25 Node 1 Node 2 Node 3 ? ? ? ? ? ?

Catch-up problem in recovery Can I catch up? Recovery speed vs Execution speed – Traditional systems have the catch-up problem Node 1 Node 2 Node 3 26

How to recover the failed node efficiently? Node 1 Node 2 Node 3 27 Separate metadata and data recovery – Phase 1: Metadata recovery – fast

How to recover the failed node efficiently? Node 1 Node 2 Node 3 28 Data Recovery in background Separate metadata and data recovery – Phase 1: Metadata recovery – fast – Phase 2: Data recovery – slow, in background

Rest of Gnothi Why is the trade-off challenging? How does Gnothi resolve the trade-off? How well does Gnothi perform? 29

Evaluation Throughput – Compare to a Gaios (Bolosky et al. NSDI 2011) like system G’. – Sequential/Random read/write – f=1 (Gnothi-3, G’-3) and f=2 (Gnothi-5 and G’-5) – Block size 4K, 64K, and 1M Failure Recovery – Compare Gnothi to G’ and Cheap Paxos – How long does recovery take? – What is the client throughput during recovery? 30

Gnothi achieves higher throughput Gnothi can achieve 40%-64% more write throughput and scalable read throughput. 31 More write throughput Scalable read throughput

Higher throughput during recovery Gnothi does not block long for failures. Gnothi can achieve 100%-200% more throughput during recovery. 32 KillRestart Cheap Paxos blocks for data copy 100%-200% more throughput Complete recovery at almost the same time No blocking

Gnothi can always catch up Tunable recovery speed. In Gnothi, the recovering node can always catch up with others. 33 Gnothi G’ Catch up Cannot catch up

Gnothi conclusion Separate Data and Metadata – Replication Improve efficiency. Ensure availability during failures. – Recovery Ensure catch-up. 34

Outline Gnothi: Efficient and Available Storage Replication – Small scale; tolerate crash faults and timing errors Salus: Robust and Scalable Block Store – Large scale; tolerate arbitrary failures Exalt: Evaluate large-scale storage systems

Problem: Not enough machines In practice – WAS in Microsoft: 60PB – HDFS in Facebook: 4000 servers – … In research – Salus: 100 servers – COPS: 300 servers – Spanner: 200 servers Research should go beyond practice.

Public testbeds Utah Emulab: 588 machines CMU Emulab: 1024 machines TACC (Texas Advanced Computing Center) – 6400 machines, but not enough storage Amazon EC2 – Cost $1400 for our Salus experiment (108 servers)

Solution 1: Extrapolation Measure with a small cluster Predict the bottleneck Assumption: resource consumption grows linearly with the scale CPU Network 100 nodes 10% 5% Extrapolate: The system can scale to 1,000 nodes. Scale Resource utilization

Solution 1: Extrapolation Measure with a small cluster Predict the bottleneck Problem: Assumption may not be true. CPU Network 100 nodes 10% Scale Resource utilization

Solution 2: Stub Build stub components to simulate real components Problem: stub component can be as complex as the original one

Solution 3: Simulation

Exalt: Evaluate 10,000 nodes on 100 machines Run real code Use fewer resources Seems impossible? – In general, Yes. – For storage systems with big data, we can achieve.

Key insight I/O is the bottleneck. However, the content of data does not matter. Solution: – We can choose a highly compressible data pattern. – Build emulated I/O devices that compress data … Emulated Network 1 million zeros compress … decompress

Challenge System may add metadata System may split data (possibly nondeteministically) Existing approaches are either inaccurate or inefficient on such mixed patterns …

Goals Can not lose metadata High compressing ratio Computationally efficient Can work with the mixed pattern

Existing approaches David (FAST 11): discard file content – Lose metadata since it’s mixed with data Gzip, etc: – Not efficient Write all zeroes and scan for zeros – Still not efficient enough

Solution: Tardis Key: we cannot choose metadata but we can choose data – Make data distinguishable from metadata Magic sequence of bytes that do not exist in metadata An integer representing number of bytes left

Tardis compression Search for magic sequence Retrieve number of bytes left (Nleft) Jump Nleft bytes Search for magic sequence again

Problems How to find a magic sequence – A randomly chosen 8-byte one works for HDFS. – Run the system, record trace, and analyze. What if system inserts metadata into data? – After jumping, check if it matches with the jumped bytes. – If not, binary search until a match is found.

Use Exalt Emulated devices have inaccurate performance. If one or several nodes are bottleneck – Run those nodes in real mode – Run other nodes in emulation mode

Use Exalt How about if the behavior depends on a large number of nodes? – E.g. 99% latency and parallel recovery Need to model the behavior of emulated devices Number of bytes Disk/Network latency Energy consumption

Implementation Bytecode Instrumentation (BCI) Emulated devices: – Disk (transparent) – Network (transparent) – Memory (need to modify code)

Preliminary results on HDFS

Proposed work Apply “separating data and metadata” to active storage in Salus Complete Exalt: – Incorporate latency modeling – Apply Exalt to more applications – Complete Tardis implementation Multiple-RSM communication – Join the project leaded by Manos – Not part of my thesis

Publications "Robustness in the Salus scalable block store". Y. Wang, M. Kapritsos, Z. Ren, P. Mahajan, J. Kirubanandam, L. Alvisi, and M. Dahlin, in NSDI "All about Eve: Execute-Verify Replication for Multi-Core Servers". M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin, in OSDI "Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication". Y. Wang, L. Alvisi, and M. Dahlin, in USENIX ATC "UpRight Cluster Services". A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin, T. Riche, in SOSP 2009.

Backup slides

Cost of Gnothi Higher write latency: – In LAN, the major latency comes from disk. – Write metadata and data together to disk. – Rethink-the-sync write should also help. Lose generality – Gnothi is only designed for block storage. 58

How does Gnothi compare to GFS/HDFS/xFS/… ? Those systems have a metadata server and multiple data servers. Gnothi updates metadata for every write and checks metadata for every read. They do that at a coarse granularity – Advantages: high scalability – Disadvantages: weaker consistency guarantee; append-only interface, worse availability, … 59

Efficient Recovery Can I catch up? Recovery speed vs Execution speed – Traditional systems have the catch-up problem Node 1 Node 2 Node 3 60

Is timing error a real threat? Can cause data inconsistency Reasons: – Network partitions – Server overloading – … A real concern in practical systems HBASE-2238 “Because HDFS and ZK are partitioned (in the sense that there's no communication between them) and there may be an unknown delay between acquiring the lock and performing the operation on HDFS you have no way of knowing that you still own the lock, like you say.” 61

Interface & Models Disk interface – A fixed number of fixed-size blocks – A request can read/write a single block – Linearizable reads and writes Asynchronous model: no maximum delay – Omission failure only – Always safe – Live when the network is synchronous 62

Architecture Fully replicated metadata Partially replicated data – Load balancing – Preferred Storage – Reserve Storage Slice 0 Metadata Preferred Virtual disk Slice 1 Slice 2 Slice 0 Slice 1 Slice 2 Slice 0 Slice 2 Slice 0 Slice 1 Reserve 63

Data can be stored out of its preferred replicas. Data Network problem Metadata Replica 0 does not have current data. Only Replica 2 has current data. 64

Gnothi: Available and Efficient Gnothi Storage Servers Gnothi Block Drivers Availability: same as Asynchronous Replication – Safe regardless of timing errors – Can use aggressive timeout App LAN 65

Gnothi: Available and Efficient Gnothi Storage Servers Gnothi Block Drivers Efficiency: – Storage/Bandwidth efficiency: write to f+1 replicas – Read efficiency: read from 1 replica App LAN 66

Previous work cannot achieve both Availability Synchronous Primary Backup: Use conservative timeouts Remus, Hypervisor, HBase, … EfficiencyAvailability Preferred Quorum: Use cold backups Cheap Paxos, ZZ, … Efficiency Availability Gaios: Scalable Read Read Storage/Bandwidth Availability Asynchronous Replication: Use 2f+1 replicas Paxos, … EfficiencyAvailability Gnothi: Separating Data and Metadata Efficiency 67

Resolving a long-standing trade-off Efficiency – Write to f+1 replicas and read from 1 replica Availability – Aggressive timeout for failure detection Consistency – Read always returns the data of the latest write. 68 Synchronous Primary Backup Asynchronous Replication Gnothi (this talk)

Catch-up problem in recovery Recovery speed vs Execution speed – Traditional systems have the catch-up problem FailRecover Node 1 Node 2 Traditional Approaches: Fetch missing data before processing new requests Cannot catch up Have to block or throttle Node 3 69

Separate Metadata and Data Recovery Metadata Recovery: fast Data Recovery: slow; in background Metadata Metadata Recovery The recovering node can process new requests after Metadata Recovery. 70 Node 1 recovers Node 2 Node 3

Separate Metadata and Data Recovery Metadata Recovery: fast Data Recovery: slow; in background Data Data Recovery Release reserve storage 71 Node 1 recovers Node 2 Node 3

Gnothi ensures catch-up FailRecover Node 1 Node 2 Gnothi: fetch missing metadata before processing new requests Traditional Approaches: fetch missing data before processing new requests Node 1 is never left behind after Metadata Recovery. Metadata Node 3 72

How does Gnothi work? Write Read Write Read Recovery How to perform writes and reads efficiently when no failures occur? How to continue processing requests during failures? How to recover the failed node efficiently? 73