Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar.

Slides:



Advertisements
Similar presentations
IT253: Computer Organization
Advertisements

High throughput chain replication for read-mostly workloads
Teaser - Introduction to Distributed Computing
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Distributed Systems Overview Ali Ghodsi
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Timeliness, Failure Detectors, and Consensus Performance Alex Shraer Joint work with Dr. Idit Keidar Technion – Israel Institute of Technology In PODC.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
Distributed Processing, Client/Server, and Clusters
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Technical Architectures
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
OPODIS 05 Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler, Seth Gilbert, Vincent Gramoli, Peter M Musial, Alexander A Shvartsman.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Fault-tolerance techniques RSM, Paxos Jinyang Li.
1 Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR.
Overview Distributed vs. decentralized Why distributed databases
Distributed Systems Fall 2011 Gossip and highly available services.
Learning from the Past for Resolving Dilemmas of Asynchrony Paul Ezhilchelvan and Santosh Shrivastava Newcastle University England, UK.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Composition Model and its code. bound:=bound+1.
National Manager Database Services
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Distributed File Systems
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Toward Fault-tolerant P2P Systems: Constructing a Stable Virtual Peer from Multiple Unstable Peers Kota Abe, Tatsuya Ueda (Presenter), Masanori Shikano,
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group.
The CoBFIT Toolkit PODC-2007, Portland, Oregon, USA August 14, 2007 HariGovind Ramasamy IBM Zurich Research Laboratory Mouna Seri and William H. Sanders.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
Copyright © 2004, Keith D Swenson, All Rights Reserved. OASIS Asynchronous Service Access Protocol (ASAP) Tutorial Overview, OASIS ASAP TC May 4, 2004.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems – Paxos
Alternative system models
Dynamo: Amazon’s Highly Available Key-value Store
#01 Client/Server Computing
View Change Protocols and Reconfiguration
Providing Secure Storage on the Internet
Implementing Consistency -- Paxos
EECS 498 Introduction to Distributed Systems Fall 2017
Distributed Systems, Consensus and Replicated State Machines
Fault-tolerance techniques RSM, Paxos
PERSPECTIVES ON THE CAP THEOREM
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Implementing Consistency -- Paxos
#01 Client/Server Computing
Presentation transcript:

Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar (Technion)

Preview 2 The setting: data-centric replicated storage –Simple network-attached storage-nodes Our contributions: 1.First distributed reconfigurable R/W storage 2.Asynch. VS. consensus-based reconfiguration Allows to add/remove storage-nodes dynamically

Enterprise Storage Systems Highly reliable customized hardware Controllers, I/O ports may become a bottleneck Expensive Usually not extensible –Different solutions for different scale –Example(HP): High end - XP (1152 disks), Mid range – EVA (324 disks) 3

Alternative – Distributed Storage Made up of many storage nodes Unreliable, cheap hardware Failures are the norm, not an exception Challenges: –Achieving reliability and consistency –Supporting reconfigurations 4

Distributed Storage Architecture Unpredictable network delays (asynchrony) Cloud Storage LAN/ WAN read write 5 Storage Clients Dynamic, Fault-prone Fault-prone Storage Nodes

A Case for Data-Centric Replication Client-side code runs replication logic –Communicates with multiple storage nodes Simple storage nodes (servers) –Can be network-attached disks Not necessarily PCs with disks Do not run application-specific code Less fault-prone components –Simply respond to client requests High throughput –Do not communicate with each other If storage-nodes communicate, their failure is likely to be correlated! Oblivious to where other replicas of each object are stored Scalable, same storage node can be used for many replication sets not-so-thin client thin storage node

Real Systems Are Dynamic 7 The challenge: maintain consistency, reliability, availability LAN/ WAN reconfig {–A, –B} A B C D E reconfig {–C, +F,…, +I} F G I H

Pitfall of Naïve Reconfiguration 8 A B C D {A, B, C, D} {A, B, C, D, E} {A, B, C} {A, B, C, D, E} {A, B, C} E delayed reconfig {+E} reconfig {-D} {A, B, C, D, E}

Returns “Italy”! Pitfall of Naïve Reconfiguration 9 A B C D {A, B, C, D, E} {A, B, C} {A, B, C, D, E} {A, B, C} E write x “Spain” read x {A, B, C, D, E} X = “Italy”, 1 X = “Spain”, 2 X = “Italy”, 1 Split Brain!

Reconfiguration Option 1: Centralized Can be automatic –E.g., Ursa Minor [Abd-El-Malek et al., FAST 05] Downtime –Most solutions stop R/W while reconfiguring Single point of failure –What if manager crashes while changing the system? 10 Tomorrow Technion servers will be down for maintenance from 5:30am to 6:45am Virtually Yours, Moshe Barak

Reconfiguration Option 2: Distributed Agreement Servers agree on next configuration –Previous solutions not data-centric No downtime In theory, might never terminate [FLP85] In practice, we have partial synchrony so it usually works 11

Reconfiguration Option 3: DynaStore [Aguilera, Keidar, Malkhi, S., PODC09] 12 Distributed & completely asynchronous No downtime Always terminates Not data-centric

In this work: DynaDisk dynamic data-centric R/W storage 13 1.First distributed data-centric solution –No downtime 2.Tunable reconfiguration method –Modular design, coordination is separate from data –Allows easily setting/comparing the coordination method –Consensus-based VS. asynchronous reconfiguration 3.Many shared objects –Running a protocol instance per object too costly –Transferring all state at once might be infeasible –Our solution: incremental state transfer 4.Built with an external (weak) location service –We formally state the requirements from such a service

Location Service Used in practice, ignored in theory We formalize the weak external service as an oracle: Not enough to solve reconfiguration 14 oracle.query( ) returns some “legal” configuration If reconfigurations stop and oracle. query() invoked infinitely many times, it eventually returns last system configuration

The Coordination Module in DynaDisk Storage devices in a configuration conf = {+A, +B, +C} z x y next config:  z x y  z x y  A BC Distributed R/W objects Updated similarly to ABD Distributed “weak snapshot” object API: update(set of changes)→OK scan() → set of updates 15

Coordination with Consensus z x y next config:  z x y  z x y  A BC reconfig({–C}) reconfig({+D}) Consensus +D –C +D update : scan: read & write-back next config from majority every scan returns +D or  16

Weak Snapshot – Weaker than consensus No need to agree on the next configuration, as long as each process has a set of possible next configurations, and all such sets intersect –Intersection allows to converge and again use a single config Non-empty intersection property of weak snapshot: –Every two non-empty sets returned by scan( ) intersect –Example: Client 1’s scan Client 2’s scan {+D} {+D} {–C} {+D, –C} {+D} {–C} Consensus 17

Coordination without consensus z x y next config: z y z y A BC reconfig({–C}) reconfig({+D}) update : scan: read & write-back proposals from majority (twice) CAS({–C}, , 0)   +D     CAS({–C}, , 1) +D –C WRITE ({–C}, 0)OK –C

Tracking Evolving Config’s With consensus: agree on next configuration Without consensus – usually a chain, sometimes a DAG: 19 A, B, C A,B,C,D +D  C A,B A, B, D A, B, C +D  C A,B,C,D A, B, D Inconsistent updates found and merged weak snapshot scan() returns {+D, -C} scan() returns {+D} All non-empty scans intersect

Consensus-based VS. Asynch. Coordination Two implementations of weak snapshots –Asynchronous –Partially synchronous (consensus-based) Active Disk Paxos [Chockler, Malkhi, 2005] Exponential backoff for leader-election Unlike asynchronous coordination, consensus-based might not terminate [FLP85] Storage overhead –Asynchronous: vector of updates vector size ≤ min(#reconfigs, #members in config) –Consensus-based: 4 integers and the chosen update –Per storage device and configuration 20

Strong progress guarantees are not for free Consensus-based Asynchronous (no consensus) Significant negative effect on R/W latency Slightly better,much more predictable reconfig latency when many reconfig execute simultaneously The same when no reconfigurations 21

Future & Ongoing Work Combine asynch. and partially-synch. coordination Consider other weak snapshot implementations –E.g., using randomized consensus Use weak snapshots to reconfigure other services –Not just for R/W 22

Summary DynaDisk – dynamic data-centric R/W storage –First decentralized solution –No downtime –Supports many objects, provides incremental reconfig –Uses one coordination object per config. (not per object) –Tunable reconfiguration method We implemented asynchronous and consensus-based Many other implementations of weak-snapshots possible Asynchronous coordination in practice: –Works in more circumstances → more robust –But, at a cost – significantly affects ongoing R/W ops 23