Decoupled Storage: “Free the Replicas!”

Slides:



Advertisements
Similar presentations
Principles of Transaction Management. Outline Transaction concepts & protocols Performance impact of concurrency control Performance tuning.
Advertisements

Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
DStore: Recovery-friendly, self-managing clustered hash table Andy Huang and Armando Fox Stanford University.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Distributed Systems 2006 Styles of Client/Server Computing.
CS 582 / CMPE 481 Distributed Systems
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Session - 14 CONCURRENCY CONTROL CONCURRENCY TECHNIQUES Matakuliah: M0184 / Pengolahan Data Distribusi Tahun: 2005 Versi:
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
PRASHANTHI NARAYAN NETTEM.
Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Concurrency Control. Objectives Management of Databases Concurrency Control Database Recovery Database Security Database Administration.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CSE 486/586 CSE 486/586 Distributed Systems Consistency Steve Ko Computer Sciences and Engineering University at Buffalo.
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
File System Consistency
The Case for a Session State Storage Layer
Distributed File Systems
Distributed Systems – Paxos
CSE-291 (Cloud Computing) Fall 2016
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Lecturer : Dr. Pavle Mogin
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
6.4 Data and File Replication
Two phase commit.
Google Filesystem Some slides taken from Alan Sussman.
Replication Control II Reading: Chapter 15 (relevant parts)
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Replication and Consistency
CSE 486/586 Distributed Systems Consistency --- 3
Consistency Models.
EECS 498 Introduction to Distributed Systems Fall 2017
Replication and Consistency
Outline Announcements Fault Tolerance.
EECS 498 Introduction to Distributed Systems Fall 2017
CSE 486/586 Distributed Systems Consistency --- 1
EECS 498 Introduction to Distributed Systems Fall 2017
Consistency and Replication
EECS 498 Introduction to Distributed Systems Fall 2017
Printed on Monday, December 31, 2018 at 2:03 PM.
Replication and Recovery in Distributed Systems
Prophecy: Using History for High-Throughput Fault Tolerance
Distributed Transactions
Lecture 21: Replication Control
Software Transactional Memory Should Not be Obstruction-Free
Lecture 20: Intro to Transactions & Logging II
Distributed Systems CS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Lecture 21: Replication Control
CSE 486/586 Distributed Systems Consistency --- 3
Concurrency Control.
Replication and Consistency
Presentation transcript:

Decoupled Storage: “Free the Replicas!” Andy Huang and Armando Fox Stanford University

What is decoupled storage (DeStor)? Goal: application-level persistent storage system for Internet services Good recovery behavior Predictable performance Related projects Decoupled version of DDS (Gribble) Federated Array of Bricks (HP Labs) at the application level Session State Server (Ling), but for persistent state

Outline Dangers of coupling Techniques for decoupling Consequences

ROWA – coupling and recovery don’t mix Read One (i.e., any) All copies must be consistent Availability coupling: data locked during recovery to bring replica up-to-date Write All Writes proceed at the rate of the slowest replica Performance coupling: system can grind to a halt if one replica degrades Possible causes of degradation: cache warming and garbage collection

Decoupled ROWA – allow replicas to say “No” Write All (but replicas can say “No”) Performance coupling: write can complete without waiting for a degraded replica Availability coupling: allowing stale values eliminates the need for locked data during recovery Issue: read may return a stale value Read One (but read all timestamps) Replicas can say “No” to a read_timestamp request Use quorums to make sure enough replicas say “Yes”

Quorums – use up-to-date information Perform reads and writes on a majority of the replicas Use timestamps to determine the correct value of a read Performance coupling Problem: requests distributed using static information Consequence: one degraded node can slow down over 50% of writes Load-balanced quorums Use current load information to select quorum participants

DeStor – two ways to look at it Decoupled ROWA “Write all” is best-effort, but write to at least a majority Read majority of timestamps to check staleness Load-balanced quorums (w/ read optimization) Use dynamic load information Read one value and majority of timestamps

DeStor write Write Issue write(key,val) to N replicas Wait for majority to ack before returning success Else, timeout and retry or return fail v.6 v.5 C R1 R2 R3 R4 write v.7 write v.7 v.7 v.6 C R1 R2 R3 R4 success

DeStor read Read Issue {v,tv}=read(key) to random replica Issue get_timestamp(key) to N replicas Find most recent timestamp t*T={t1,t2,…} If (tv=t*), return v Else, issue read(key) to replica with tn=t* v.7 v.6 C R1 R2 R3 R4 read time read v.7 v.6 C R1 R2 R3 R4 value,v.7 value v.7 v.6

Decoupling further – unlock the data 2-phase commit – ensures atomicity among replicas Couples replicas between phases Locking complicates the implementation and recovery 2PC not needed for DeStor? R1 R2 R3 C1 C2 R4 x=2 x=1 Client-generated physical timestamps API: Single-operation transactions with no partial updates R1 R2 R3 C1 C2 R4 y=2 x=1 (1,0) (1,2) (0,2) Assumption: clients operate independently C1 C2 DeStor r v.6 w v.7

Client failure – what can happen w/o locks Issue Less than majority are written R2 and R3  v.6 R1 and R2/R3  v.7 Serializability Once v.7 is read, make sure it is the majority Idea: write v.7 didn’t happen until it was read R1 R2 R3 C1 v.7 v.6 R4 C2 R1 R2 R3 v.7 v.6 R4

Timestamps – loose synchronization is sufficient Unsynchronized clocks Issue: client’s writes are “lost” because other writers’ timestamps are always more recent Why that’s okay: clients are independent, so they can’t differentiate a “lost write” from an overwritten value Caveat: a user is often behind the client requests User sees inter-request causality NTP synchronizes clocks within milliseconds, which is sufficient for human-speed interactions

Consequence – behavior is more restricted Good recovery behavior Data available throughout crash and recovery Performance degradation during cache warming doesn’t affect other replicas Predictable performance DeStor vs. ROWA: DeStor has better write throughput and latency at the cost of read throughput and latency Key: better degradation characteristics  more predictable performance

Performance: predictable Twrite T1= throughput of a single replica D1= % degradation of one replica D = % system degradation = [−slope/Tmax]D1 ROWA: Tmax= T1 slope = -T1 D = D1 DeStor: Tmax= (N/Q)T1  T1 ≤ Tmax ≤ 2T1 slope = −T1/Q = −2T1/(N+1) D = D1/N T N=7 N=5 T1 N=3 D1 1

Performance: slightly degraded Tread ROWA: Tmax = NT1 slope = −T1 D = D1/N DeStor: depends on overhead of read_timeout request Tmax = NT1 – (N/Q)[overhead] slope = –T1 – (T1/Q)[overhead] D ≈ D1/N T NT1 (N-1)T1 T2 T1 D1 1

Research issues – once replicas are free… Next step: simulate ROWA and DeStor Measure: read and write throughput/latency Factors: object size, working set size, read-write mix Opens up new options for system administration Online repartitioning, scaling, and replica replacement Raises new issues for performance optimizations When in-memory replication is persistent enough (non-write-through replicas)

Summary Application-level persistent storage system Replication scheme Write all, wait for majority Read any, read majority of timestamps Consequences Data availability throughout recovery Predictable performance when replicas degrade