CS5412: HOW MUCH ORDERING? Ken Birman 1 CS5412 Spring 2012 (Cloud Computing: Birman) Lecture XVI.

Slides:



Advertisements
Similar presentations
EE384y: Packet Switch Architectures
Advertisements

Computer Networks TCP/IP Protocol Suite.
1 Concurrency: Deadlock and Starvation Chapter 6.
Zhongxing Telecom Pakistan (Pvt.) Ltd
Copyright © 2003 Pearson Education, Inc. Slide 1.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Dr. Multicast for Data Center Communication Scalability Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM.
DCV: A Causality Detection Approach for Large- scale Dynamic Collaboration Environments Jiang-Ming Yang Microsoft Research Asia Ning Gu, Qi-Wei Zhang,
Chapter 5 Input/Output 5.1 Principles of I/O hardware
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.
1 Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create (
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Week 2 The Object-Oriented Approach to Requirements
Time, Clocks, and the Ordering of Events in a Distributed System
SE-292 High Performance Computing
Configuration management
OOAD – Dr. A. Alghamdi Mastering Object-Oriented Analysis and Design with UML Module 3: Requirements Overview Module 3 - Requirements Overview.
Local Area Networks - Internetworking
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Project 5: Virtual Memory
Hash Tables.
CS 241 Spring 2007 System Programming 1 Memory Replacement Policies Lecture 32 Klara Nahrstedt.
Chapter 10: Virtual Memory
Christophe Jelger – CS221 Network and Security - Universität Basel Christophe Jelger Post-doctoral researcher IP Multicasting.
By Waqas Over the many years the people have studied software-development approaches to figure out which approaches are quickest, cheapest, most.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Ken Birman 1. Refresher: Dekers Algorithm Assumes two threads, numbered 0 and 1 CSEnter(int i) { int J = i^1; inside[i] = true; turn = J; while(inside[J]
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Chapter 6 File Systems 6.1 Files 6.2 Directories
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
Executional Architecture
10 -1 Chapter 10 Amortized Analysis A sequence of operations: OP 1, OP 2, … OP m OP i : several pops (from the stack) and one push (into the stack)
Systems Analysis and Design in a Changing World, Fifth Edition
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Network Operations & administration CS 4592 Lecture 15 Instructor: Ibrahim Tariq.
PSSA Preparation.
Foundations of Data Structures Practical Session #7 AVL Trees 2.
Dan Deng 10/30/11 Ordering and Consistent Cuts 1.
9. Two Functions of Two Random Variables
Isis 2 guarantees Did you understand how Isis 2 ordering and durability work?
CPSC 322, Lecture 5Slide 1 Uninformed Search Computer Science cpsc322, Lecture 5 (Textbook Chpt 3.5) Sept, 14, 2012.
TCP/IP Protocol Suite 1 Chapter 18 Upon completion you will be able to: Remote Login: Telnet Understand how TELNET works Understand the role of NVT in.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Chapter 9: Using Classes and Objects. Understanding Class Concepts Types of classes – Classes that are only application programs with a Main() method.
Logical Clocks (2).
The University of Adelaide, School of Computer Science
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Distributed Systems 2006 Group Communication I * *With material adapted from Ken Birman.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
CS5412: HOW MUCH ORDERING? Ken Birman 1 CS5412 Spring 2014 (Cloud Computing: Birman) Lecture XVI.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Ordering and Consistent Cuts Presented By Biswanath Panda.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Distributed Systems 2006 Retrofitting Reliability* *With material adapted from Ken Birman.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
Gursharan Singh Tatla Transport Layer 16-May
CS514: Intermediate Course in Operating Systems Professor Ken Birman Krzys Ostrowski: TA.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
COORDINATION, SYNCHRONIZATION AND LOCKING WITH ISIS2 Ken Birman 1 Cornell University.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Lecture 12-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) October 4, 2012 Lecture 12 Mutual Exclusion.
Fault Tolerance (2). Topics r Reliable Group Communication.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
Presentation transcript:

CS5412: HOW MUCH ORDERING? Ken Birman 1 CS5412 Spring 2012 (Cloud Computing: Birman) Lecture XVI

Ordering CS5412 Spring 2012 (Cloud Computing: Birman) 2 The key to consistency turns has turned out to be delivery ordering (durability is a separate thing) Given replicas that are initially in the same state… … if we apply the same updates (with no gaps or dups) in the same order, they stay in the same state. Weve seen how the virtual synchrony model uses this notion of order for Delivering membership view events Delivery of new update events

But what does same order mean? CS5412 Spring 2012 (Cloud Computing: Birman) 3 The easy answer is to assume that the same order means just what is says Every member gets every message in the identical sequence This was what we called a synchronous behavior Better term might be closely synchronous since we arent using synchronous clocks Synchronous execution

As an example… CS5412 Spring 2012 (Cloud Computing: Birman) 4 Suppose some group manages variables X and Y P sends updates to X and Y, and so does Q P: X = X-2 Q: X = 17.3 Q: Y = Y*2 + X T: Y = 99 The updates conflict: order matters The model keeps the replicas synchronized

But what if items have leaders CS5412 Spring 2012 (Cloud Computing: Birman) 5 Suppose all the updates to X are by P All the updates to Y are by Q Nobody ever looks at X and Y simultaneously Could this ever arise? Certainly! Many systems keep things like inventories Updates might be done as we add or remove items from the stockroom

Does this impact ordering? CS5412 Spring 2012 (Cloud Computing: Birman) 6 Now the rule is simpler As long as we perform updates in the order the leader issued them, for each given item, the replicas of the item remain consistent Here we see a FIFO ordering: with multiple leaders we have multiple FIFO streams, but each one is behaving like a 1-n version of TCP

Update the monitoring and alarms criteria for Mrs. Marsh as follows… Confirmed Response delay seen by end-user would also include Internet latencies Local response delay flush Send Execution timeline for an individual first-tier replica Soft-state first-tier service If A is the only process to handle updates, a FIFO Send is all we need to maintain consistency 7 Revisiting our medical scenario A B C D CS5412 Spring 2012 (Cloud Computing: Birman)

From FIFO to causal... CS5412 Spring 2012 (Cloud Computing: Birman) 8 A fancier FIFO ordering policy can also arise Consider P and Q that both update X but with locks: First P obtains the lock before starting to do updates Then it sends updates for item X for a while Then it releases the lock and Q acquires it Then Q sends updates on X, too What ordering rule is needed here?

Update the monitoring and alarms criteria for Mrs. Marsh as follows… Confirmed Response delay seen by end-user would also include Internet latencies Local response delay flush Send Execution timeline for an individual first-tier replica Soft-state first-tier service A B C D Notice that the send by C is after the send by A 9 Causal ordering variation CS5412 Spring 2012 (Cloud Computing: Birman)

Causal ordering CS5412 Spring 2012 (Cloud Computing: Birman) 10 This example illustrates a concept Leslie Lamport calls causal ordering As release of the lock on X to B caused B to issue updates on X. When B was done, A resumed. The update order is As, then Bs, then As. Lamports happened-before relation captures this If P sends m, and Q sends m, and m m, then we want m delivered before m Called a causal delivery rule

Mutual exclusion Dark blue when holding the lock Lock moving around is like a thread of control that moves from process to process Our goal is FIFO along the causal thread and the causal order is thus exactly what we need to enforce In effect, causal order is like total order except that the sender moves around over time A B C D E CS5412 Spring 2012 (Cloud Computing: Birman) 11

Same idea with several locks Suppose red defines the lock on X Blue is the lock on Y The relative ordering of X/Y updates isnt important because those events commute: they update different variables Causal order captures this too pqrst CS5412 Spring 2012 (Cloud Computing: Birman) 12

Can we implement causal delivery? CS5412 Spring 2012 (Cloud Computing: Birman) 13 Think about how one implements FIFO multicast We just put a counter value in each outgoing multicast Nodes keep track and deliver in sequence order Substitute a vector timestamp We put a list of counters on each outgoing multicast Nodes deliver multicasts only if they are next in the causal ordering No extra rounds required, just a bit of extra space (one counter for each possible sender)

Total ordering CS5412 Spring 2012 (Cloud Computing: Birman) 14 Multicasts in a single agreed order no matter who sends them, without locking required SafeSend (Paxos) has this property Isis 2 also provides a faster OrderedSend: total ordering, but without strong durability

Levels of ordering one can use CS5412 Spring 2012 (Cloud Computing: Birman) 15 No ordering or even no reliability (like IP multicast) FIFO ordering (requires an integer counter) Causal ordering (requires vector timestamps) Total ordering (requires a form of lock). Can be implemented as a causal and total order Paxos agreed ordering (tied to strong durability) Isis 2 offers all of these options

Consistent cuts and Total Order CS5412 Spring 2012 (Cloud Computing: Birman) 16 Recall our discussion of consistent cuts Like an instant in time for a distributed system Guess what: An event triggered by a totally ordered message delivery happens on a consistent cut! For example, it is safe to use a totally ordered query to check for a deadlock, or to count something The answer will be correct No ghost deadlocks or double counting or undercounting

Isis 2 multicast primitives RawSend: No guarantees Send: FIFO CausalSend: Causal order OrderedSend: Total order SafeSend: Paxos Flush: Durability (not needed for SafeSend) In-memory/disk durability (SafeSend only) Ability to specify the number of acceptors (SafeSend) 17 CS5412 Spring 2012 (Cloud Computing: Birman) Names for PrimitivesAdditional Options … all come in P2P and multicast forms, and all can be used as basis of Query requests

Will people need so many choices? CS5412 Spring 2012 (Cloud Computing: Birman) 18 Most developers start by using OrderedSend for situations where strong durability isnt a key requirement (total order) SafeSend if total order plus strong durability is needed Then they switch to weaker ordering primitives if Application has a structure that permits it Performance benefit outweighs the added complexity Using the right primitive lets you pay for exactly what you need

Virtual synchrony recap 19 Virtual synchrony is a consistency model: Synchronous runs: indistinguishable from non-replicated object that saw the same updates (like Paxos) Virtually synchronous runs are indistinguishable from synchronous runs Synchronous executionVirtually synchronous execution Non-replicated reference execution A=3B=7B = B-A A=A+1 CS5412 Spring 2012 (Cloud Computing: Birman)

Some additional Isis 2 features CS5412 Spring 2012 (Cloud Computing: Birman) 20 State transfer and logging User registers a method that can checkpoint group state, and methods to load from checkpoint Isis 2 will move such a checkpoint to a new member, or store it into a file, at appropriate times

Security CS5412 Spring 2012 (Cloud Computing: Birman) 21 Based on 256-bit AES keys Two cases: Key for the entire system, and per-group keys. System keys: used to sign messages (not encrypt!) Per-group keys: all data sent on the network is encrypted first But where do the keys themselves get stored?

Security CS5412 Spring 2012 (Cloud Computing: Birman) 22 One option is to keep the key material outside of Isis 2 in a standard certificate repository Application would start up, fetch certificate, find keys inside, and hand them to Isis 2 This is the recommended approach A second option allows Isis 2 to create keys itself But these will be stored in files under your user-id File protection guards these: only you can access them If someone were to log in as you, they could find the keys and decrypt group traffic

Flow control CS5412 Spring 2012 (Cloud Computing: Birman) 23 Two forms Built-in flow control is automatic and attempts to avoid overload situations in which senders swamp (some) receivers with too much traffic, causing them to fall behind and, eventually, to crash This is always in force except when using RawSend

Flow control CS5412 Spring 2012 (Cloud Computing: Birman) 24 The other form is user-controlled: You specify a leaky bucket policy, Isis 2 implements it Tokens flow into a bucket at a rate you can specify They also age out eventually (leak) Each multicast costs a token and waits if the bucket is empty Fully automated flow control appears to be very hard and may be impractical

Dr. Multicast CS5412 Spring 2012 (Cloud Computing: Birman) 25 Something else Isis 2 does is to manage the choice of how multicast gets sent Several cases Isis 2 can use IP multicast, if permitted. User controls the range of port numbers and the maximum number of groups Isis 2 can send packets over UDP, if UDP is allowed and a particular group doesnt have permission to use Dr. Multicast Isis 2 can tunnel over an overlay network of TCP links (a kind of tree with log(N) branching factor at each level)

Anatomy of a meltdown A blend of stories (eBay, Amazon, Yahoo): Pub-sub message bus very popular. System scaled up. Rolled out a faster ethernet. Product uses IPMC to accelerate sending All goes well until one day, under heavy load, loss rates spike, triggering collapse Oscillation observed CS5412 Spring 2012 (Cloud Computing: Birman) 26

IPMC aggregation and flow control! Recall: IPMC became promiscuous because too many multicast channels were used And this triggered meltdowns Why not aggregate (combine) IPMC channels? When two channels have similar receiver sets, combine them into one channel Filter (discard) unwanted extra messages CS5412 Spring 2012 (Cloud Computing: Birman) 27

Application sees what looks like a normal IPMC interface (socket library) We intercept requests and map them to IPMC groups of our choice (or even to UDP) Dr. Multicast 28 CS5412 Spring 2012 (Cloud Computing: Birman)

Channel Aggregation Algorithm by Vigfusson, Tock papers: [HotNets 09, LADIS 2008] Uses a k-means clustering algorithm Generalized problem is NP complete But heuristic works well in practice CS5412 Spring 2012 (Cloud Computing: Birman) 29

Optimization Questions o Assign IPMC and unicast addresses s.t. % receiver filtering (hard) Min. network traffic # IPMC addresses (hard) Prefers sender load over receiver load Intuitive control knobs as part of the policy (1) CS5412 Spring 2012 (Cloud Computing: Birman) 30

MCMD Heuristic Topics in `user- interest space FGIF B EER G ROUP F REE F OOD (1,1,1,1,1,0,1,0,1,0,1,1)(0,1,1,1,1,1,1,0,0,1,1,1) 31 CS5412 Spring 2012 (Cloud Computing: Birman)

MCMD Heuristic Topics in `user- interest space CS5412 Spring 2012 (Cloud Computing: Birman) 32

MCMD Heuristic Topics in `user- interest space Filtering cost: MAX Sending cost: CS5412 Spring 2012 (Cloud Computing: Birman) 33

MCMD Heuristic Topics in `user- interest space Filtering cost: MAX Sending cost: Unicast CS5412 Spring 2012 (Cloud Computing: Birman) 34

MCMD Heuristic Topics in `user- interest space Unicast CS5412 Spring 2012 (Cloud Computing: Birman) 35

Using the Solution Procs L-IPMC Heuristic multicast Procs L-IPMC Processes use logical IPMC addresses Dr. Multicast transparently maps these to true IPMC addresses or 1:1 UDP sends CS5412 Spring 2012 (Cloud Computing: Birman) 36

Effectiveness? We looked at various group scenarios Most of the traffic is carried by <20% of groups For IBM Websphere, Dr. Multicast achieves 18x reduction in physical IPMC addresses [Dr. Multicast: Rx for Data Center Communication Scalability. Ymir Vigfusson, Hussam Abu- Libdeh, Mahesh Balakrishnan, Ken Birman, and Yoav Tock. LADIS November 2008.] CS5412 Spring 2012 (Cloud Computing: Birman) 37

Dr. Multicast in Isis 2 CS5412 Spring 2012 (Cloud Computing: Birman) 38 System automatically tracks membership, data rates Periodically runs an optimization algorithm Merges similar groups Applies the Dr. Multicast greedy heuristic Isis 2 protocols think they are multicasting, but a logical to physical mapping will determine whether messages are sent via IPMC, 1-n UDP or the tree- tunnelling layer, all automatically

Large groups CS5412 Spring 2012 (Cloud Computing: Birman) 39 Isis 2 has two styles of acknowledgment protocol For small groups (up to ~1000 members), direct acks Large groups use a tree of token rings: slower, but very steady (intended for ,000 members) Also supports a scalable way to do queries with massive parallelism, based on aggregation Very likely that as we gain experience, well refine the way large groups are handled

Example: Parallel search Replies = g.query(LOOKUP, Name=*Smith); g.callback(myReplyHndlr, Replies, typeof(double)); public void myReplyHndlr(double[] fnd) { foreach(double d in fnd) avg += d; … } public void myLookup(string who) { divide work into viewSize() chunks this replica will search chunk # getMyRank(); ….. reply(myAnswer); } Group g = new Group(/amazon/something); g.register(LOOKUP, myLookup); Could overwhelm receiver CS5412 Spring 2012 (Cloud Computing: Birman) 40

Scalable Aggregation Used if group is really big Request, updates: still via multicast Response is aggregated within a tree Level 0 Level 1 Level 2 Agg(v a v b v c v d ) query a a ca c db vava vbvb vcvc vdvd Agg(v c v d ) Agg(v a v b ) reply Example: nodes {a,b,c,d} collaborate to perform a query CS5412 Spring 2012 (Cloud Computing: Birman) 41

Aggregated Parallel search Replies = g.query(LOOKUP, 27, Name=*Smith); g.callback(myReplyHndlr, Replies, typeof(double)); public void myReplyHndlr(double[] fnd) { The answer is in fnd[0]…. } public void myLookup(int rid, string who) { divide work into viewSize() chunks this replica will search chunk # getMyRank(); ….. SetAggregateValue(myAnswer); } Group g = new Group(/amazon/something); g.register(LOOKUP, myLookup); Rval = GetAggregateResult(27); Reply(Rval/DatabaseSize); CS5412 Spring 2012 (Cloud Computing: Birman) 42

Large groups CS5412 Spring 2012 (Cloud Computing: Birman) 43 They can only be used in a few ways All sending is actually done by the rank-0 member. If others send, a relaying mechanism forwards the message via the rank-0 member This use of Send does guarantee causal order: in fact it provides a causal, total ordering No support for SafeSend Thus most of the fancy features of Isis 2 are only for use in small groups

Recall our community slide? Weve seen how many (not all) of this was built! The system is very powerful with a wide variety of possible use styles and cases Isis 2 user object Isis 2 library Group instances and multicast protocols Flow Control Membership Oracle Large Group LayerTCP tunnels (overlay)Dr. MulticastPlatform Security Reliable SendingFragmentationGroup-Key Mgt Sense Runtime Environment Self-stabilizing Bootstrap Protocol Socket Mgt/Send/Rcv Send CausalSend OrderedSend SafeSend Query.... Message LibraryWrapped locks Bounded Buffers Oracle Membership Group membership Report suspected failures Views Other group members 44

Isis 2 offers (too) many choices! CS5412 Spring 2012 (Cloud Computing: Birman) 45 PrimitiveFIFO/Total?Causal?Weak/Strong DurabilitySmall/Large RawSend, RawP2PSend, RawQuery FIFONoNot even reliableEither Send, etc (same set of variants) FIFO if underlying group is small. Total order if large. NoReliable, weak durability (calling Flush assures strong durability) Either CausalSendFIFO+CausalYesReliable, weakOnly small OrderedSendTotalNoReliable, weakOnly small SafeSendTotalNoReliable, strongOnly small Aggregated QueryTotalNoReliable, weakOnly large Also: Secure/insecure, logged/not logged For SafeSend: # of acceptors, Disk vs. in-memory durability

Choice or simplicity CS5412 Spring 2012 (Cloud Computing: Birman) 46 Many developers just use Paxos Has the strongest properties, hence a good one-size- fits-all option. SafeSend with disk durability in Isis 2 But Paxos can be slow and this is one reason CAP is applied in the first tier of the cloud Isis 2 has a wide range of options Intended to permit experiments, innovative ideas Pay for what you need and use… SafeSend if you like … flexibility permits higher performance

Recommendation? CS5412 Spring 2012 (Cloud Computing: Birman) 47 We urge people to use Isis 2 but to initially start with very simple applications and styles of use Fancy features are for fancy use cases that really need them… many applications wont! Plan is to eventually offer a kind of recipe for building various standard applications in good ways… user would copy and evolve them.