CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.

Slides:



Advertisements
Similar presentations
Global States.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
1 CS 194: Elections, Exclusion and Transactions Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Lock-Based Concurrency Control
6.852: Distributed Algorithms Spring, 2008 Class 7.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Nummenmaa & Thanish: Practical Distributed Commit in Modern Environments PDCS’01 PRACTICAL DISTRIBUTED COMMIT IN MODERN ENVIRONMENTS by Jyrki Nummenmaa.
COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Transactions A process that reads or modifies the DB is called a transaction. It is a unit of execution of database operations. Basic JDBC transaction.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Recovery from Crashes. Transactions A process that reads or modifies the DB is called a transaction. It is a unit of execution of database operations.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Ken Birman Cornell University. CS5410 Fall
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Recovery from Crashes. ACID A transaction is atomic -- all or none property. If it executes partly, an invalid state is likely to result. A transaction,
Concurrency Control and Recovery In real life: users access the database concurrently, and systems crash. Concurrent access to the database also improves.
CS514: Intermediate Course in Operating Systems
CMPT 431 Dr. Alexandra Fedorova Lecture VIII: Time And Global Clocks.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Reliable Distributed Systems Applications: Part I.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Transactions Amol Deshpande CMSC424. Today Project stuff… Summer Internships 
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Distributed Systems Fall 2009 Distributed transactions.
TRANSACTION PROCESSING TECHNIQUES BY SON NGUYEN VIJAY RAO.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Distributed Deadlocks and Transaction Recovery.
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
CS5412: REPLICATION, CONSISTENCY AND CLOCKS Ken Birman 1 CS5412 Spring 2014 (Cloud Computing: Birman) Lecture X.
Transaction Communications Yi Sun. Outline Transaction ACID Property Distributed transaction Two phase commit protocol Nested transaction.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
University of Tampere, CS Department Distributed Commit.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Two-Phase Commit Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
IM NTU Distributed Information Systems 2004 Distributed Transactions -- 1 Distributed Transactions Yih-Kuen Tsay Dept. of Information Management National.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 5 Instructor: Haifeng YU.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
1 Controlled concurrency Now we start looking at what kind of concurrency we should allow We first look at uncontrolled concurrency and see what happens.
A PC Wakes Up A STORY BY VICTOR NORMAN. Once upon a time…  a PC (we’ll call him “H”) is connected to a network and turned on. Aside: The network looks.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Chapter 11 Global Properties (Distributed Termination)
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
SYSTEMS IMPLEMENTATION TECHNIQUES TRANSACTION PROCESSING DATABASE RECOVERY DATABASE SECURITY CONCURRENCY CONTROL.
Distributed Systems Lecture 6 Global states and snapshots 1.
Snapshots, checkpoints, rollback, and restart
Two phase commit.
Distributed Systems: Atomicity, decision making, snapshots
EECS 498 Introduction to Distributed Systems Fall 2017
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
Reliable Distributed Systems
CS514: Intermediate Course in Operating Systems
CS514: Intermediate Course in Operating Systems
Last Class: Fault Tolerance
Transactions, Properties of Transactions
Presentation transcript:

CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA

Applications of these ideas Over the past three weeks we’ve heard about Gossip protocols Distributed monitoring, search, event notification Agreement protocols, such as 2PC and 3PC Underlying theme: some things need stronger forms of consistency, some can manage with weaker properties Today, let’s look at an application that could run over several of these options, but where the consistency issue is especially “clear”

Let’s start with 2PC and transactions The problem: Some applications perform operations on multiple databases We would like a guarantee that either all the databases get updated, or none does The relevant paradigm? 2PC

Problem: Pictorial version Goal? Either p succeeds, and both lists get updated, or something fails and neither does p Employees database Coffee fund Create new employee Add to 3 rd -floor coffee fund

Issues? P could crash part way through… … a database could throw an exception, e.g. “invalid SSN” or “duplicate record” … a database could crash, then restart, and may have “forgotten” uncommitted updates (presumed abort)

2PC is a good match! Adopt the view that each database votes on its willingness to commit Until the commit actually occurs, update is considered temporary In fact, database is permitted to discard a pending update (covers crash/restart case) 2PC covers all of these cases

Solution P runs the transactions, but warns databases that these are part of a transaction on multiple databases They need to retain locks & logs When finished, run a 2PC protocol Until they vote “ok” a database can abort 2PC decides outcome and informs them

Low availability? One concern: we know that 2PC blocks It can happen if two processes fail It would need to happen at a particular stage of execution and be the “right” two… but this scenario isn’t all that rare Options? Could use 3PC to reduce (not eliminate!) this risk, but will pay a cost on every transaction Or just accept the risk Can eliminate the risk with special hardware but may pay a fortune!

Drilling down Why would 3PC reduce but not eliminate the problem? It adds extra baggage and complexity And the result is that if we had a perfect failure detector, the bad scenario is gone … but we only have timeouts … so there is still a bad scenario! It just turns out to be less likely, if we estimate risks So: risk of getting stuck is “slashed”

Drilling down Why not just put up with this risk? Even the 3PC solution can still sometimes get stuck Maybe the “I’m stuck” scenario should be viewed as basic property of this kind of database replication! This approach leads towards “wizards” that sense the problem and then help DB admin- istrator relaunch database if it does get stuck

Drilling down What about special hardware? Usually, we would focus on dual ported disks that have a special kind of switching feature Only one node “owns” a disk at a time. If a node fails, some other node will “take over” its disk Now we can directly access the state of a failed node, hence can make progress in that mystery scenario that worried us But this can add costs to the hardware

Connection to consistency We’re trying to ensure a form of “all or nothing” consistency using 2PC Idea for our database is to either do the transaction on all servers, or on none But this concept can be generalized

Auditing Suppose we want to “audit” a system Involves producing a summary of the state Should look as if system was idle Some options (so far)… Gossip to aggregate the system state Use RPC to ask everyone to report their state. With 2PC, first freeze the whole system (phase 1), then snapshot the state.

Auditing AssetsLiabilities $1,241,761,251.23$875,221,117.17

Uses for auditing In a bank, may be the only practical way to understand “institutional risk” Need to capture state at some instant in time. If branches report status at closing time, a bank that operates world-wide gets inconsistent answers! In a security-conscious system, might audit to try and identify source of a leak In a hospital, want ability to find out which people examined which records In an airline, might want to know about maintenance needs, spare parts inventory

Other kinds of auditing In a complex system that uses locking might want to audit to see if a deadlock has arisen In a system that maintains distributed objects we could “audit” to see if objects are referenced by anyone, and garbage collect those that aren’t

Challenges In a complex system, such as a big distributed web services system, we won’t “know” all the components The guy starting the algorithm knows it uses servers X and Y But server X talks to subsystem A, and Y talks to B and C… Algorithms need to “chase links”

Implications? Our gossip algorithms might be ok for this scenario: they have a natural ability to chase links A simple RPC scheme (“tell me your state”) becomes a nested RPC

Nested RPC X Y Z A B

Temporal distortions Things can be complicated because we can’t predict Message delays (they vary constantly) Execution speeds (often a process shares a machine with many other tasks) Timing of external events Lamport looked at this question too

Temporal distortions What does “now” mean? p 0 a f e p 3 b p 2 p 1 c d

Temporal distortions What does “now” mean? p 0 a f e p 3 b p 2 p 1 c d

Temporal distortions Timelines can “stretch”… … caused by scheduling effects, message delays, message loss… p 0 a f e p 3 b p 2 p 1 c d

Temporal distortions Timelines can “shrink” E.g. something lets a machine speed up p 0 a f e p 3 b p 2 p 1 c d

Temporal distortions Cuts represent instants of time. But not every “cut” makes sense Black cuts could occur but not gray ones. p 0 a f e p 3 b p 2 p 1 c d

Consistent cuts and snapshots Idea is to identify system states that “might” have occurred in real-life Need to avoid capturing states in which a message is received but nobody is shown as having sent it This the problem with the gray cuts

Temporal distortions Red messages cross gray cuts “backwards” p 0 a f e p 3 b p 2 p 1 c d

Temporal distortions Red messages cross gray cuts “backwards” In a nutshell: the cut includes a message that “was never sent” p 0 a e p 3 b p 2 p 1 c

Who cares? In our auditing example, we might think some of the bank’s money is missing Or suppose that we want to do distributed deadlock detection System lets processes “wait” for actions by other processes A process can only do one thing at a time A deadlock occurs if there is a circular wait

Deadlock detection “algorithm” p worries: perhaps we have a deadlock p is waiting for q, so sends “what’s your state?” q, on receipt, is waiting for r, so sends the same question… and r for s…. And s is waiting on p.

Suppose we detect this state We see a cycle… … but is it a deadlock? pq s r Waiting for

Phantom deadlocks! Suppose system has a very high rate of locking. Then perhaps a lock release message “passed” a query message i.e. we see “q waiting for r” and “r waiting for s” but in fact, by the time we checked r, q was no longer waiting! In effect: we checked for deadlock on a gray cut – an inconsistent cut.

One solution is to “freeze” the system X Y Z A B STOP!

One solution is to “freeze” the system X Y Z A B STOP! Ok… Yes sir! I’ll be late! Was I speeding? Sigh…

One solution is to “freeze” the system X Y Z A B Sorry to trouble you, folks. I just need a status snapshot, please

One solution is to “freeze” the system X Y Z A B No problem Hey, doesn’t a guy have a right to privacy? Done… Here you go… Sigh…

One solution is to “freeze” the system X Y Z A B Ok, you can go now

Why does it work? When we check bank accounts, or check for deadlock, the system is idle So if “P is waiting for Q” and “Q is waiting for R” we really mean “simultaneously” But to get this guarantee we did something very costly because no new work is being done!

Consistent cuts and snapshots Goal is to draw a line across the system state such that Every message “received” by a process is shown as having been sent by some other process Some pending messages might still be in communication channels And we want to do this while running

Turn idea into an algorithm To start a new snapshot, p i … Builds a message: “P i is initiating snapshot k”. The tuple (p i, k) uniquely identifies the snapshot Writes down its own state Starts recording incoming messages on all channels

Turn idea into an algorithm Now p i tells its neighbors to start a snapshot In general, on first learning about snapshot (p i, k), p x Writes down its state: p x ’s contribution to the snapshot Starts “tape recorders” for all communication channels Forwards the message on all outgoing channels Stops “tape recorder” for a channel when a snapshot message for (p i, k) is received on it Snapshot consists of all the local state contributions and all the tape-recordings for the channels

Chandy/Lamport Outgoing wave of requests… incoming wave of snapshots and channel state Snapshot ends up accumulating at the initiator, p i Algorithm doesn’t tolerate process failures or message failures.

Chandy/Lamport p q r s t u v w x y z A network

Chandy/Lamport p q r s t u v w x y z A network I want to start a snapshot

Chandy/Lamport p q r s t u v w x y z A network p records local state

Chandy/Lamport p q r s t u v w x y z A network p starts monitoring incoming channels

Chandy/Lamport p q r s t u v w x y z A network “contents of channel p- y”

Chandy/Lamport p q r s t u v w x y z A network p floods message on outgoing channels…

Chandy/Lamport p q r s t u v w x y z A network

Chandy/Lamport p q r s t u v w x y z A network q is done

Chandy/Lamport p q r s t u v w x y z A network q

Chandy/Lamport p q r s t u v w x y z A network q

Chandy/Lamport p q r s t u v w x y z A network q z s

Chandy/Lamport p q r s t u v w x y z A network q v z x u s

Chandy/Lamport p q r s t u v w x y z A network q v w z x u s y r

Chandy/Lamport p q r s t u v w x y z A snapshot of a network q x u s v r t w p y z Done!

Practical implication Snapshot won’t occur at a point in real time Could be noticeable to certain kinds of auditors In some situations only a truly instantaneous audit can be accepted, but this isn’t common What belongs in the snapshot? Local states… namely “status of X when you asked” Messages in transit… e.g. of we’re transferring $1M from X to Y (otherwise that money would be missing)

Recap and summary We’ve begun to develop powerful, general tools They aren’t always of a form that the platform can (or should) standardize But we can understand them as templates that can be specialized to our needs Thinking this way lets us see that many practical questions are just instances of the templates we’ve touched on in the course

What next? We’ll resume the development of primitives for replicating data First, notion of group membership Turns out to have a very strong connection to our snapshot algorithm! Then fault-tolerant multicast Then ordered multicast delivery Finally leads to virtual synchrony “model” Then tackle more practical problems