CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 21 Nov. 7.

Slides:

Advertisements

Similar presentations

Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University.

Advertisements

COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.

1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.

6.852: Distributed Algorithms Spring, 2008 Class 7.

G. Alonso, D. Kossmann Systems Group

Optimizing Buffer Management for Reliable Multicast Zhen Xiao AT&T Labs – Research Joint work with Ken Birman and Robbert van Renesse.

Reliable Group Communication Quanzeng You & Haoliang Wang.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.

Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.

CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.

Distributed Systems 2006 Group Communication I * *With material adapted from Ken Birman.

1 Improving the Performance of Distributed Applications Using Active Networks Mohamed M. Hefeeda 4/28/1999.

Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.

Ken Birman Cornell University. CS5410 Fall

Qi Huang CS FA 10/29/09. Basic idea: same data needs to reach a set of multiple receivers Application: What is multicast?

Ken Birman Cornell University. CS5410 Fall

CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.

588 Section 6 Neil Spring May 11, Schedule Notes – (1 slide) Multicast review –(3slides) RLM (the paper you didn’t read) –(3 slides) ALF & SRM –(8.

802.11n MAC layer simulation Submitted by: Niv Tokman Aya Mire Oren Gur-Arie.

CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 20 Nov. 2.

1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.

Distributed Systems 2006 Retrofitting Reliability* *With material adapted from Ken Birman.

CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.

Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.

Group Communication Robbert van Renesse CS614 – Tuesday Feb 20, 2001.

Anonymous Gossip: Improving Multicast Reliability in Mobile Ad-Hoc Networks Ranveer Chandra (joint work with Venugopalan Ramasubramanian and Ken Birman)

Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman.

CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 24: Nov. 16.

Multicast Communication Multicast is the delivery of a message to a group of receivers simultaneously in a single transmission from the source – The source.

Epidemics Michael Ford Simon Krueger 1. IT’S JUST LIKE TELEPHONE! 2.

A Randomized Error Recovery Algorithm for Reliable Multicast Zhen Xiao Ken Birman AT&T Labs – Research Cornell University.

Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.

Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.

1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.

Ensemble and Beyond Presentation to David Tennenhouse, DARPA ITO Ken Birman Dept. of Computer Science Cornell University.

Group Communication Group oriented activities are steadily increasing. There are many types of groups:  Open and Closed groups  Peer-to-peer and hierarchical.

Farnaz Moradi Based on slides by Andreas Larsson 2013.

2007/1/15http:// Lightweight Probabilistic Broadcast M2 Tatsuya Shirai M1 Dai Saito.

Hwajung Lee. A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types of groups:

TELE202 Lecture 6 Routing in WAN 1 Lecturer Dr Z. Huang Overview ¥Last Lecture »Packet switching in Wide Area Networks »Source: chapter 10 ¥This Lecture.

The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.

V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.

TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.

Networks, Part 2 March 7, Networks End to End Layer  Build upon unreliable Network Layer  As needed, compensate for latency, ordering, data.

Netprog: Chat1 Chat Issues and Ideas for Service Design Refs: RFC 1459 (IRC)

Fault Tolerance (2). Topics r Reliable Group Communication.

CSE 486/586 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.

TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).

Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.

Distributed Systems Lecture 7 Multicast 1. Previous lecture Global states – Cuts – Collecting state – Algorithms 2.

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

1 Group Communications: Host Group and IGMP Dr. Rocky K. C. Chang 19 March, 2002.

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.

CSE 486/586 Distributed Systems Gossiping

Replication & Fault Tolerance CONARD JAMES B. FARAON

The consensus problem in distributed systems

Reliable group communication

Advanced Operating System

Active replication for fault tolerance

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Presentation transcript:

CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 21 Nov. 7

Bimodal Multicast Work is joint: –Ken Birman, Mark Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu, Yaron Minsky Paper appeared in ACM Transactions on Computer Systems –May, 1999

Reliable Multicast Family of protocols for 1-n communication Increasingly important in modern networks “Reliability” tends to be at one of two extremes on the spectrum: –Very strong guarantees, for example to replicate a controller position in an ATC system –Best effort guarantees, for applications like Internet conferencing or radio

Reliable Multicast Common to try and reduce problems such as this to theory Issue is to specify the goal of the multicast protocol and then demonstrate that a particular implementation satisfies its spec. Each form of reliability has its down specification and its own pros and cons

Classical “multicast” problems Related to “coordinated attack,” “consensus” The two “schools” of reliable multicast –Virtual synchrony model (Isis, Horus, Ensemble) All or nothing message delivery with ordering Membership managed on behalf of group State transfer to joining member –Scalable reliable multicast Local repair of problems but no end-to-end guarantees

Virtual Synchrony Model crash G 0 ={p,q} G 1 ={p,q,r,s} G 2 ={q,r,s} G 3 ={q,r,s,t} pqrstpqrst r, s request to join r,s added; state xfer t added, state xfer t requests to join p fails... to date, the only widely adopted model for consistency and fault-tolerance in highly available networked applications

Virtual Synchrony Tools Various forms of replication: –Replicated data, replicate an object, state transfer for starting new replicas... –1-many event streams (“network news”) Load-balanced and fault-tolerant request execution Management of groups of nodes or machines in a network setting

ATC system (simplified) Controllers Air Traffic Database (flight plans, etc) X.500 Directory Radar Onboard

Roles for reliable multicast? Replicate the state of controller consoles so that a group of consoles can support a team of controllers fault-tolerantly

Roles for reliable multicast? Distribute routine updates from the radar system, every 3-5 seconds (two kinds: images and digital tracking info)

Roles for reliable multicast? Distribute flight plan updates and decisions rarely (very critical information)

Other settings that use reliable multicast Basis of several stock exchanges Could be used to implement enterprise database systems and wide-area network services Could be the basis of web caching tools

But vsync protocols don’t scale well For small systems, performance is great. For large systems can demonstrate very good performance… but only if applications “cooperate” In large systems, under heavy load, with applications that may act a little flakey, performance is very hard to maintain

Stock Exchange Problem: Vsync. multicast is too “fragile” Most members are healthy…. … but one is slow

Throughput as one member of a multicast group is "perturbed" by forcing it to sleep for varying amounts of time.

Why does this happen? Superficially, because data for the slow process piles up in the sender’s buffer, causing flow control to kick in (prematurely) But why does the problem grow worse as a function of group size, with just one “red” process? Our theory is that small perturbations happen all the time (but more study is needed)

Dilemma confronting developers Application is extremely critical: stock market, air traffic control, medical system Hence need a strong model, guarantees –SRM, RMTP, etc: model is too weak –Anyhow, we’ve seen that they don’t scale either! But these applications often have a soft- realtime subsystem –Steady data generation –May need to deliver over a large scale

Today introduce a new design pt. Bimodal multicast is reliable in a sense that can be formalized, at least for some networks –Generalization for larger class of networks should be possible but maybe not easy Protocol is also very stable under steady load even if 25% of processes are perturbed Scalable in much the same way as SRM

Bimodal Attack General commands soldiers If almost all attack victory is certain If very few attack, they will perish but Empire survives If many attack, Empire is lost Attack! Consensus is impossible!

Worse and worse... General sends orders via couriers, who might perish or get lost in the forest (no spies or traitors) Some of the armies are sick with flu. They won’t fight and may not even have healthy couriers to participate in protocol (Message omission, crash and timing faults.)

Properties Desired Bimodal distribution of multicast delivery Ordering: FIFO on sender basis Stability: lets us garbage collect without violating bimodal distribution Detection of lost messages

More desired properties Formal model allows us to reason about how the primitive will behave in real systems Scalable protocol: costs (background messages, throughput) do not degrade as a function of system size Extremely stable throughput

Naming our new protocol Bimodal doesn’t lend itself to traditional contractions (bcast is taken, bicast and bimcast sound pretty awful) Focus on probabilistic nature of guarantees Call it pbcast for sure Historical footnote: these are multicast protocols but everyone uses “bcast” in names

Environment Will assume that most links have known throughput and loss properties Also assume that most processes are responsive to messages in bounded time But can tolerate some flakey links and some crashed or slow processes.

Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)

Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.

Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip

Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.

Delivery? Garbage Collection? Deliver a message when it is in FIFO order Garbage collect a message when you believe that no “healthy” process could still need a copy (we used to wait 10 rounds, but now are using gossip to detect this condition) Match parameters to intended environment

Need to bound costs Worries: –Someone could fall behind and never catch up, endlessly loading everyone else –What if some process has lots of stuff others want and they bombard him with requests? –What about scalability in buffering and in list of members of the system, or costs of updating that list?

Optimizations Request retransmissions most recent multicast first Idea is to “catch up quickly” leaving at most one gap in the retrieved sequence

Optimizations Participants bound the amount of data they will retransmit during any given round of gossip. If too much is solicited they ignore the excess requests

Optimizations Label each gossip message with senders gossip round number Ignore solicitations that have expired round number, reasoning that they arrived very late hence are probably no longer correct

Optimizations Don’t retransmit same message twice in a row to any given destination (the copy may still be in transit hence request may be redundant)

Optimizations Use IP multicast when retransmitting a message if several processes lack a copy –For example, if solicited twice –Also, if a retransmission is received from “far away” –Tradeoff: excess messages versus low latency Use regional TTL to restrict multicast scope

Scalability Protocol is scalable except for its use of the membership of the full process group Updates could be costly Size of list could be costly In large groups, would also prefer not to gossip over long high-latency links

Can extend pbcast to solve both Could use IP multicast to send initial message. (Right now, we have a tree-structured alternative, but to use it, need to know the membership) Tell each process only about some subset k of the processes, k << N Keeps costs constant.

Router overload problem Random gossip can overload a central router Yet information flowing through this router is of diminishing quality as rate of gossip rises Insight: constant rate of gossip is achievable and adequate

Sender Receiver High noise rate Sender Receiver Limited bandwidth (1500Kbits) Bandwidth of other links:10Mbits

Hierarchical Gossip Weight gossip so that probability of gossip to a remote cluster is smaller Can adjust weight to have constant load on router Now propagation delays rise… but just increase rate of gossip to compensate

Hierarchical Gossip

Remainder of talk Show results of formal analysis –We developed a model (won’t do the math here -- nothing very fancy) –Used model to solve for expected reliability Then show more experimental data Real question: what would pbcast “do” in the Internet? Our experience: it works!

Idea behind analysis Can use the mathematics of epidemic theory to predict reliability of the protocol Assume an initial state Now look at result of running B rounds of gossip: converges exponentially quickly towards atomic delivery

Failure analysis Suppose someone tells me what they hope to “avoid” Model as a predicate on final system state Can compute the probability that pbcast would terminate in that state, again from the model

Two predicates Predicate I: A faulty outcome is one where more than 10% but less than 90% of the processes get the multicast …call this the “General’s predicate” because he views it as a disaster if many but not most of his troops attack

Two predicates Predicate II: A faulty outcome is one where roughly half get the multicast and failures might “conceal” true outcome … this would make sense if using pbcast to distribute quorum-style updates to replicated data. The costly hence undesired outcome is the one where we need to rollback because outcome is “uncertain”

Two predicates Predicate I: More than 10% but less than 90% of the processes get the multicast Predicate II: Roughly half get the multicast but crash failures might “conceal” outcome Easy to add your own predicate. Our methodology supports any predicate over final system state

Figure 5: Graphs of analytical results

Discussion We see that pbcast is indeed bimodal even in worst case, when initial multicast fails Can easily tune parameters to obtain desired guarantees of reliability Protocol is suitable for use in applications where bounded risk of undesired outcome is sufficient

Model makes assumptions... These are rather simplistic Yet the model seems to predict behavior in real networks, anyhow In effect, the protocol is not merely robust to process perturbation and message loss, but also to perturbation of the model itself Speculate that this is due to the incredible power of exponential convergence...

Experimental work SP2 is a large network –Nodes are basically UNIX workstations –Interconnect is basically an ATM network –Software is standard Internet stack (TCP, UDP) We obtained access to as many as 128 nodes on Cornell SP2 in Theory Center Ran pbcast on this, and also ran a second implementation on a real network

Example of a question Create a group of 8 members Perturb one member in style of Figure 1 Now look at “stability” of throughput –Measure rate of received messages during periods of 100ms each –Plot histogram over life of experiment

Now revisit Figure 1 in detail Take 8 machines Perturb 1 Pump data in at varying rates, look at rate of received messages

Throughput with perturbed processes

Throughput variation as a function of scale

Impact of packet loss on reliability and retransmission rate Notice that when network becomes overloaded, healthy processes experience packet loss!

What about growth of overhead? Look at messages other than original data distribution multicast Measure worst case scenario: costs at main generator of multicasts Side remark: all of these graphs look identical with multiple senders or if overhead is measured elsewhere….

Growth of Overhead? Clearly, overhead does grow We know it will be bounded except for probabilistic phenomena At peak, load is still fairly low

Pbcast versus SRM, 0.1% packet loss rate on all links Tree networks Star networks

Pbcast versus SRM: link utilization

Pbcast versus SRM: 300 members on a 1000-node tree, 0.1% packet loss rate

Pbcast Versus SRM: Interarrival Spacing

Pbcast versus SRM: Interarrival spacing (500 nodes, 300 members, 1.0% packet loss)

Real Data: Spinglass on a 10Mbit ethernet (35 Ultrasparc’s) Injected noise, retransmission limit disabled Injected noise, retransmission limit re-enabled

Networks structured as clusters Sender Receiver High noise rate Sender Receiver Limited bandwidth (1500Kbits) Bandwidth of other links:10Mbits

Delivery latency in a 2-cluster LAN, 50% noise between clusters, 1% elsewhere

Requests/repairs and latencies with bounded router bandwidth

Discussion Saw that stability of protocol is exceptional even under heavy perturbation Overhead is low and stays low with system size, bounded even for heavy perturbation Throughput is extremely steady In contrast, virtual synchrony and SRM both are fragile under this sort of attack

Programming with pbcast? Most often would want to split application into multiple subsystems –Use pbcast for subsystems that generate regular flow of data and can tolerate infrequent loss if risk is bounded –Use stronger properties for subsystems with less load and that need high availability and consistency at all times

Programming with pbcast? In stock exchange, use pbcast for pricing but abcast for “control” operations In hospital use pbcast for telemetry data but use abcast when changing medication In air traffic system use pbcast for routine radar track updates but abcast when pilot registers a flight plan change

Our vision: One protocol side-by-side with the other Use virtual synchrony for replicated data and control actions, where strong guarantees are needed for safety Use pbcast for high data rates, steady flows of information, where longer term properties are critical but individual multicast is of less critical importance

Next steps Currently exploring memory use patterns and scheduling issues for the protocol Will study: –How much memory is really needed? Can we configure some processes with small buffers? –When to use IP multicast and what TTL to use –Exploiting network topology information –Application feedback options

Summary New data point in spectrum –Virtual synchrony protocols (atomic multicast as normally implemented) –Bimodal probabilistic multicast –Scalable reliable multicast Demonstrated that pbcast is suitable for analytic work Saw that it has exceptional stability

Epilogue Baron: “Although successful, your attack was reckless! It is impossible to achieve consensus on attack under such conditions!” General: “I merely took a calculated risk….”