CS5412: HOW DURABLE SHOULD IT BE? Ken Birman 1 CS5412 Spring 2015 (Cloud Computing: Birman) Lecture XV.

Slides:



Advertisements
Similar presentations
COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
Advertisements

Isis 2 Design Choices A few puzzles to think about when considering use of Isis 2 in your work.
Isis 2 guarantees Did you understand how Isis 2 ordering and durability work?
CS 542: Topics in Distributed Systems Diganta Goswami.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
CS5412: HOW DURABLE SHOULD IT BE? Ken Birman 1 CS5412 Spring 2014 (Cloud Computing: Birman) Lecture XV.
1 CSIS 7102 Spring 2004 Lecture 8: Recovery (overview) Dr. King-Ip Lin.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Distributed Systems 2006 Group Communication I * *With material adapted from Ken Birman.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Ken Birman Cornell University. CS5410 Fall
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Ordering and Consistent Cuts Presented By Biswanath Panda.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Ken Birman Professor, Dept. of Computer Science.  Today’s cloud computing platforms are best for building “apps” like YouTube, web search  Highly elastic,
Distributed Systems 2006 Retrofitting Reliability* *With material adapted from Ken Birman.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Synchronization in Distributed Systems CS-4513 D-term Synchronization in Distributed Systems CS-4513 Distributed Computing Systems (Slides include.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Networked File System CS Introduction to Operating Systems.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Practical Byzantine Fault Tolerance
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
1 MONGODB: CH ADMIN CSSE 533 Week 4, Spring, 2015.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
CS5412: HOW DURABLE SHOULD IT BE? Ken Birman 1 CS5412 Spring 2012 (Cloud Computing: Birman) Lecture XV.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
COORDINATION, SYNCHRONIZATION AND LOCKING WITH ISIS2 Ken Birman 1 Cornell University.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Two-Phase Commit Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
Fault Tolerance (2). Topics r Reliable Group Communication.
ORDERING AND DURABILITY IN ISIS 2 Ken Birman 1 Cornell University.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
A SHORT DETOUR: THE THEORY OF CONSENSUS Ken Birman 1 Cornell University.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Consensus, impossibility results and Paxos Ken Birman.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS5412: How Durable Should it Be?
The consensus problem in distributed systems
Distributed Systems – Paxos
MongoDB Distributed Write and Read
Consistency in Distributed Systems
View Change Protocols and Reconfiguration
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
EECS 498 Introduction to Distributed Systems Fall 2017
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CS514: Intermediate Course in Operating Systems
The SMART Way to Migrate Replicated Stateful Services
Implementing Consistency -- Paxos
Presentation transcript:

CS5412: HOW DURABLE SHOULD IT BE? Ken Birman 1 CS5412 Spring 2015 (Cloud Computing: Birman) Lecture XV

Choices, choices… CS5412 Spring 2015 (Cloud Computing: Birman) 2  A system like Isis 2 lets you control message ordering, durability, while Paxos opts for strong guarantees.  With Isis 2, start with total order (g.OrderedSend) but then relax order to speed things up if no conflicts (inconsistency) would arise.

How much ordering does it need? CS5412 Spring 2015 (Cloud Computing: Birman) 3  Example: we have some group managing replicated data, using g.OrderedSend() for updates  But perhaps only one group member is connected to the sensor that generates the updates  With just one source of updates, g.Send() is faster  Isis 2 will discover this simple case automatically, but in more complex situations, the application designer might need to explictly use g.Send() to be sure.

Durability CS5412 Spring 2015 (Cloud Computing: Birman) 4  When a system accepts an update and won’t lose it, we say that event has become durable  They say the cloud has a permanent memory  Once data enters a cloud system, they rarely discard it  More common to make lots of copies, index it…  But loss of data due to a failure is an issue

Durability in real systems CS5412 Spring 2015 (Cloud Computing: Birman) 5  Database components normally offer durability  Paxos also has durability.  Like a database of “messages” saved for replay into services that need consistent state  Systems like Isis 2 focus on consistency for multicast and for these, durability is optional (and costly)

Should Consistency “require” Durability? CS5412 Spring 2015 (Cloud Computing: Birman) 6  The Paxos protocol guarantees durability to the extent that its command lists are durable  Normally we run Paxos with the messages (the “list of commands”) on disk, and hence Paxos can survive any crash  In Isis 2, this is g.SafeSend with the “DiskLogger” active  But doing so slows the protocol down compared to not logging messages so durably

Consider the first tier of the cloud CS5412 Spring 2015 (Cloud Computing: Birman) 7  Recall that applications in the first tier are limited to what Brewer calls “Soft State”  They are basically prepositioned virtual machines that the cloud can launch or shutdown very elastically  But when they shut down, lose their “state” including any temporary files  Always restart in the initial state that was wrapped up in the VM when it was built: no durable disk files

Examples of soft state? CS5412 Spring 2015 (Cloud Computing: Birman) 8  Anything that was cached but “really” lives in a database or file server elsewhere in the cloud  If you wake up with a cold cache, you just need to reload it with fresh data  Monitoring parameters, control data that you need to get “fresh” in any case  Includes data like “The current state of the air traffic control system” – for many applications, your old state is just not used when you resume after being offline  Getting fresh, current information guarantees that you’ll be in sync with the other cloud components  Information that gets reloaded in any case, e.g. sensor values

Would it make sense to use Paxos? CS5412 Spring 2015 (Cloud Computing: Birman) 9  We do maintain sharded data in the first tier and some requests certainly trigger updates  So that argues in favor of a consistency mechanism  In fact consistency can be important even in the first tier, for some cloud computing uses

Control of the smart power grid 10  Suppose that a cloud control system speaks with “two voices”  In physical infrastructure settings, consequences can be very costly “Switch on the 50KV Canadian bus” “Canadian 50KV bus going offline” Bang! CS5412 Spring 2015 (Cloud Computing: Birman)

So… would we use Paxos here? CS5412 Spring 2015 (Cloud Computing: Birman) 11  In discussion of the CAP conjecture and their papers on the BASE methodology, authors generally assume that “C” in CAP is about ACID guarantees or Paxos  Then argue that these bring too much delay to be used in settings where fast response is critical  Hence they argue against Paxos

By now we’ve seen a second option CS5412 Spring 2015 (Cloud Computing: Birman) 12  Virtual synchrony Send is “like” Paxos yet different  Paxos has a very strong form of durability  Send has consistency but weak durability unless you use the “Flush” primitive. Send+Flush is amnesia-free  Further complicating the issue, in Isis 2 Paxos is called SafeSend, and has several options  Can set the number of acceptors  Can also configure to run in-memory or with disk logging

How would we pick? CS5412 Spring 2015 (Cloud Computing: Birman) 13  The application code looks nearly identical!  g.Send(GRIDCONTROL, action to take)  g.SafeSend(GRIDCONTROL, action to take)  Yet the behavior is very different!  SafeSend is slower  … and has stronger durability properties. Or does it?

SafeSend in the first tier CS5412 Spring 2015 (Cloud Computing: Birman) 14  Observation: like it or not we just don’t have a durable place for disk files in the first tier  The only forms of durability are  In-memory replication within a shard  Inner-tier storage subsystems like databases or files  Moreover, the first tier is expect to be rapidly responsive and to talk to inner tiers asynchronously

So our choice is simplified CS5412 Spring 2015 (Cloud Computing: Birman) 15  No matter what anyone might tell you, in fact the only real choices are between two options  Send + Flush: Before replying to the external customer, we know that the data is replicated in the shard  In-memory SafeSend: On an update by update basis, before each update is taken, we know that the update will be done at every replica in the shard

Consistency model: Virtual synchrony meets Paxos (and they live happily ever after…) 16  Virtual synchrony is a “consistency” model:  Synchronous runs: indistinguishable from non-replicated object that saw the same updates (like Paxos)  Virtually synchronous runs are indistinguishable from synchronous runs Synchronous executionVirtually synchronous execution Non-replicated reference execution A=3B=7B = B-A A=A+1 CS5412 Spring 2015 (Cloud Computing: Birman)

SafeSend vs OrderedSend vs Send CS5412 Spring 2015 (Cloud Computing: Birman) 17  SafeSend is durable and totally ordered and never has any form of odd behavior. Logs messages, replays them after a group shuts down and then later restarts. == Paxos.  OrderedSend is much faster but doesn’t log the messages (not durable) and also is “optimistic” in a sense we will discuss. Sometimes must combine with Flush.  Send is FIFO and optimistic, and also may need to be combined with Flush.

Looking closely at that “oddity” CS5412 Spring 2015 (Cloud Computing: Birman) 18 Virtually synchronous execution “amnesia” example (Send but without calling Flush)

What made it odd? CS5412 Spring 2015 (Cloud Computing: Birman) 19  In this example a network partition occurred and, before anyone noticed, some messages were sent and delivered  “Flush” would have blocked the caller, and SafeSend would not have delivered those messages  Then the failure erases the events in question: no evidence remains at all  So was this bad? OK? A kind of transient internal inconsistency that repaired itself?

Looking closely at that “oddity” CS5412 Spring 2015 (Cloud Computing: Birman) 20

Looking closely at that “oddity” CS5412 Spring 2015 (Cloud Computing: Birman) 21

Looking closely at that “oddity” CS5412 Spring 2015 (Cloud Computing: Birman) 22

Paxos avoided the issue… at a price CS5412 Spring 2015 (Cloud Computing: Birman) 23  SafeSend, Paxos and other multi-phase protocols don’t deliver in the first round/phase  This gives them stronger safety on a message by message basis, but also makes them slower and less scalable  Is this a price we should pay for better speed?

Update the monitoring and alarms criteria for Mrs. Marsh as follows… Confirmed Response delay seen by end-user would also include Internet latencies Local response delay flush Send Execution timeline for an individual first-tier replica Soft-state first-tier service A B C D  An online monitoring system might focus on real-time response and be less concerned with data durability 24 Revisiting our medical scenario CS5412 Spring 2015 (Cloud Computing: Birman)

Isis 2 : Send v.s. in-memory SafeSend 25 Send scales best, but SafeSend with in-memory (rather than disk) logging and small numbers of acceptors isn’t terrible. CS5412 Spring 2015 (Cloud Computing: Birman)

Jitter: how “steady” are latencies? CS5412 Spring 2015 (Cloud Computing: Birman) 26 The “spread” of latencies is much better (tighter) with Send: the 2-phase SafeSend protocol is sensitive to scheduling delays

Flush delay as function of shard size CS5412 Spring 2015 (Cloud Computing: Birman) 27 Flush is fairly fast if we only wait for acks from 3-5 members, but is slow if we wait for acks from all members. After we saw this graph, we changed Isis 2 to let users set the threshold.

First-tier “mindset” for tolerant f faults CS5412 Spring 2015 (Cloud Computing: Birman) 28  Suppose we do this:  Receive request  Compute locally using consistent data and perform updates on sharded replicated data, consistently  Asynchronously forward updates to services deeper in cloud but don’t wait for them to be performed  Use the “flush” to make sure we have f+1replicas  Call this an “amnesia free” solution. Will it be fast enough? Durable enough?

Which replicas? CS5412 Spring 2015 (Cloud Computing: Birman) 29  One worry is this  If the first tier is totally under control of a cloud management infrastructure, elasticity could cause our shard to be entirely shut down “abruptly”  Fortunately, most cloud platforms do have some ways to notify management system of shard membership  This allows the membership system to shut down members of multiple shards without ever depopulating any single shard  Now the odds of a sudden amnesia event become low

Advantage: Send+Flush? CS5412 Spring 2015 (Cloud Computing: Birman) 30  It seems that way, but there is a counter-argument  The problem centers on the Flush delay  We pay it both on writes and on some reads  If a replica has been updated by an unstable multicast, it can’t safely be read until a Flush occurs  Thus need to call Flush prior to replying to client even in a read-only procedure Delay will occur only if there are pending unstable multicasts

We don’t need this with SafeSend CS5412 Spring 2015 (Cloud Computing: Birman) 31  In effect, it does the work of Flush prior to the delivery (“learn”) event  So we have slower delivery, but now any replica is always safe to read and we can reply to the client instantly  In effect the updater sees delay on his critical path, but the reader has no delays, ever

Advantage: SafeSend? CS5412 Spring 2015 (Cloud Computing: Birman) 32  Argument would be that with both protocols, there is a delay on the critical path where the update was initiated  But only Send+Flush ever delays in a pure reader  So SafeSend is faster!  But this argument is flawed…

Flaws in that argument CS5412 Spring 2015 (Cloud Computing: Birman) 33  The delays aren’t of the same length (in fact the pure reader calls Flush but would rarely be delayed)  Moreover, if a request does multiple updates, we delay on each of them for SafeSend, but delay just once if we do Send…Send…Send…Flush  How to resolve?

Only real option is to experiment CS5412 Spring 2015 (Cloud Computing: Birman) 34  In the cloud we often see questions that arise at  Large scale,  High event rates,  … and where millisecond timings matter  Best to use tools to help visualize performance  Let’s see how one was used in developing Isis 2

Something was… strangely slow CS5412 Spring 2015 (Cloud Computing: Birman) 35  We weren’t sure why or where  Only saw it at high data rates in big shards  So we ended up creating a visualization tool just to see how long the system needed from when a message was sent until it was delivered  Here’s what we saw

Debugging: Stabilization bug 36 Eventually it pauses. The delay is similar to a Flush delay. A backlog was forming At first Isis 2 is running very fast (as we later learned, too fast to sustain) CS5412 Spring 2015 (Cloud Computing: Birman)

Debugging : Stabilization bug fixed 37 The revised protocol is actually a tiny bit slower, but now we can sustain the rate CS5412 Spring 2015 (Cloud Computing: Birman)

Debugging : 358-node run slowdown 38 Original problem but at an even larger scale CS5412 Spring 2015 (Cloud Computing: Birman)

358-node run slowdown: Zoom in 39 Hard to make sense of the situation: Too much data! CS5412 Spring 2015 (Cloud Computing: Birman)

358-node run slowdown: Filter 40 Filtering is a necessary part of this kind of experimental performance debugging! CS5412 Spring 2015 (Cloud Computing: Birman)

What did we just see? CS5412 Spring 2015 (Cloud Computing: Birman) 41  Flow control is pretty important!  With a good multicast flow control algorithm, we can garbage collect spare copies of our Send or OrderedSend messages before they pile up and stay in a kind of balance  Why did we need spares? … To resend if the sender fails.  When can they be garbage collected? … When they become stable  How can the sender tell? … Because it gets acknowledgements from recipients

What did we just see? CS5412 Spring 2015 (Cloud Computing: Birman) 42  … in effect, we saw that one can get a reliable virtually synchronous ordered multicast to deliver messages at a steady rate

Would this be true for Paxos too? CS5412 Spring 2015 (Cloud Computing: Birman) 43  Yes, for some versions of Paxos  The Isis 2 version of Paxos, SafeSend, works a bit like OrderedSend and is stable for a similar reason  There are also versions of Paxos such a ring Paxos that have a structure designed to make them stable and to give them a flow control property  But not every version of Paxos is stable in this sense

Interesting insight… CS5412 Spring 2015 (Cloud Computing: Birman) 44  In fact, most versions of Paxos will tend to be bursty.…  The fastest Q W group members respond to a request before the slowest N-Q W, allowing them to advance while the laggards develop a backlog  This lets Paxos surge ahead, but suppose that conditions change (remember, the cloud is a world of strange scheduling delays and load shifts). One of those laggards will be needed to reestablish a quorum of size Q W  … but it may take a while for them to deal with the backlog and join the group!  Hence Paxos (as normally implemented) will exhibit long delays, triggered when cloud-computing conditions change

Conclusions? CS5412 Spring 2015 (Cloud Computing: Birman) 45  A question like “how much durability do I need in the first tier of the cloud” is easy to ask… harder to answer!  Study of the choices reveals two basic options  Send + Flush  SafeSend, in-memory  They actually are similar but SafeSend has an internal “flush” before any delivery occurs, on each request  SafeSend seems more costly  Steadiness of the underlying flow of messages favors optimistic early delivery protocols such as Send and OrderedSend. Classical versions of Paxos may be very bursty