Gossip Protocols Bimodal Multicast and T-MAN

Slides:



Advertisements
Similar presentations
ALMA MATER STUDIORUM – UNIVERSITA DI BOLOGNA ALMA MATER STUDIORUM – UNIVERSITA DI BOLOGNA Gossipping in Bologna Ozalp Babaoglu.
Advertisements

COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Optimizing Buffer Management for Reliable Multicast Zhen Xiao AT&T Labs – Research Joint work with Ken Birman and Robbert van Renesse.
Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.
Reliable Group Communication Quanzeng You & Haoliang Wang.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
588 Section 6 Neil Spring May 11, Schedule Notes – (1 slide) Multicast review –(3slides) RLM (the paper you didn’t read) –(3 slides) ALF & SRM –(8.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 21 Nov. 7.
Multicast Communication Multicast is the delivery of a message to a group of receivers simultaneously in a single transmission from the source – The source.
Communication (II) Chapter 4
© 2002, Cisco Systems, Inc. All rights reserved..
ACN: RED paper1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug.
Growth Codes: Maximizing Sensor Network Data Persistence abhinav Kamra, Vishal Misra, Jon Feldman, Dan Rubenstein Columbia University, Google Inc. (SIGSOMM’06)
2007/1/15http:// Lightweight Probabilistic Broadcast M2 Tatsuya Shirai M1 Dai Saito.
1 CS 4396 Computer Networks Lab TCP – Part II. 2 Flow Control Congestion Control Retransmission Timeout TCP:
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
Fault Tolerance (2). Topics r Reliable Group Communication.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Day 13 Intro to MANs and WANs. MANs Cover a larger distance than LANs –Typically multiple buildings, office park Usually in the shape of a ring –Typically.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 8 TCP/IP Suite Error and Control Messages.
1 Group Communications: Host Group and IGMP Dr. Rocky K. C. Chang 19 March, 2002.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
William Stallings Data and Computer Communications
Network Layer.
Congestion Control in Data Networks and Internets
CSE 486/586 Distributed Systems Gossiping
CS 5565 Network Architecture and Protocols
The Transport Layer Implementation Services Functions Protocols
Chapter 9: Transport Layer
Internet Networking recitation #9
Chapter 3 outline 3.1 transport-layer services
Introduction to Wireless Sensor Networks
21-2 ICMP(Internet control message protocol)
CMPT 371 Data Communications and Networking
Network Layer.
Gossip-based Data Dissemination
Intra-Domain Routing Jacob Strauss September 14, 2006.
Routing: Distance Vector Algorithm
Congestion Control, Internet transport protocols: udp
湖南大学-信息科学与工程学院-计算机与科学系
Transport Layer Unit 5.
Chapter 5 TCP Transmission Control
TCP - Part II Relates to Lab 5. This is an extended module that covers TCP flow control, congestion control, and error control in TCP.
Sarah Diesburg Operating Systems COP 4610
Lecture 19 – TCP Performance
Advanced Operating System
COS 561: Advanced Computer Networks
Strayer University at Arlington, VA
IT351: Mobile & Wireless Computing
Internet Networking recitation #10
Congestion Control (from Chapter 05)
Andy Wang Operating Systems COP 4610 / CGS 5765
Congestion Control, Internet Transport Protocols: UDP
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Congestion Control (from Chapter 05)
Transport Layer: Congestion Control
Congestion Control (from Chapter 05)
Transport Protocols: TCP Segments, Flow control and Connection Setup
Congestion Control (from Chapter 05)
Network Layer.
TCP: Transmission Control Protocol Part II : Protocol Mechanisms
Distributed Systems CS
Presentation transcript:

Gossip Protocols Bimodal Multicast and T-MAN Ken Birman

Gossip: A valuable tool… So-called gossip protocols can be robust even when everything else is malfunctioning Idea is to build a distributed protocol a bit like gossip among humans “Did you hear that Sally and John are going out?” Gossip spreads like lightening…

Gossip: basic idea First rigorous analysis was by Alan Demers! Node A encounters “randomly selected” node B (might not be totally random) Gossip push (“rumor mongering”): A tells B something B doesn’t know Gossip pull (“anti-entropy”) A asks B for something it is trying to “find” Push-pull gossip Combines both mechanisms

Definition: A gossip protocol… Uses [mostly] random pairwise state merge Runs at a steady rate (much slower than network) [mostly] Uses bounded-size messages Does not depend on messages getting through reliably

Gossip benefits… and limitations Information flows around disruptions Scales very well Typically reacts to new events in log(N) Can be made self- repairing Rather slow Very redundant Guarantees are at best probabilistic Depends heavily on the randomness of the peer selection

For example We could use gossip to track the health of system components We can use gossip to report when something important happens In the remainder of today’s talk we’ll focus on event notifications. Next week we’ll look at some of these other uses

Typical push-pull protocol Nodes have some form of database of participating machines Could have a hacked bootstrap, then use gossip to keep this up to date! Set a timer and when it goes off, select a peer within the database Send it some form of “state digest” It responds with data you need and its own state digest You respond with data it needs

Where did the “state” come from? The data eligible for gossip is usually kept in some sort of table accessible to the gossip protocol This way a separate thread can run the gossip protocol It does upcalls to the application when incoming information is received

Gossip often runs over UDP Recall that UDP is an “unreliable” datagram protocol supported in internet Unlike for TCP, data can be lost Also packets have a maximum size, usually 4k or 8k bytes (you can control this) Larger packets are more likely to get lost! What if a packet would get too large? Gossip layer needs to pick the most valuable stuff to include, and leave out the rest!

Algorithms that use gossip Gossip can be used to… Notify applications about some event Track the status of applications in a system Organize the nodes in some way (like into a tree, or even sorted by some index) Find “things” (like files) Let’s look closely at an example

Bimodal multicast This is Cornell work from about 15 years ago Goal is to combine gossip with UDP multicast to make a very robust multicast protocol

Stock Exchange Problem: Sometimes, someone is slow… Most members are healthy…. … but one is slow i.e. something is contending with the red process, delaying its handling of incoming messages…

Remedy? Use gossip Start by using unreliable UDP multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)

Remedy? Use gossip Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.

Remedy? Use gossip Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip

Remedy? Use gossip Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.

Delivery? Garbage Collection? Deliver a message when it is in FIFO order Report an unrecoverable loss if a gap persists for so long that recovery is deemed “impractical” Garbage collect a message when you believe that no “healthy” process could still need a copy (we used to wait 10 rounds, but now are using gossip to detect this condition) Match parameters to intended environment

Need to bound costs Worries: Someone could fall behind and never catch up, endlessly loading everyone else What if some process has lots of stuff others want and they bombard him with requests? What about scalability in buffering and in list of members of the system, or costs of updating that list?

Optimizations Request retransmissions most recent multicast first Idea is to “catch up quickly” leaving at most one gap in the retrieved sequence

Optimizations Participants bound the amount of data they will retransmit during any given round of gossip. If too much is solicited they ignore the excess requests

Optimizations Label each gossip message with senders gossip round number Ignore solicitations that have expired round number, reasoning that they arrived very late hence are probably no longer correct

Optimizations Don’t retransmit same message twice in a row to any given destination (the copy may still be in transit hence request may be redundant)

Optimizations Use UDP multicast when retransmitting a message if several processes lack a copy For example, if solicited twice Also, if a retransmission is received from “far away” Tradeoff: excess messages versus low latency Use regional TTL to restrict multicast scope

Scalability Protocol is scalable except for its use of the membership of the full process group Updates could be costly Size of list could be costly In large groups, would also prefer not to gossip over long high-latency links

Router overload problem Random gossip can overload a central router Yet information flowing through this router is of diminishing quality as rate of gossip rises Insight: constant rate of gossip is achievable and adequate

Latency to delivery (ms) Load on WAN link (msgs/sec) Sender Receiver Limited bandwidth (1500Kbits) Bandwidth of other links:10Mbits Sender Receiver High noise rate Latency to delivery (ms) Load on WAN link (msgs/sec)

Hierarchical Gossip Weight gossip so that probability of gossip to a remote cluster is smaller Can adjust weight to have constant load on router Now propagation delays rise… but just increase rate of gossip to compensate

Idea behind analysis Can use the mathematics of epidemic theory to predict reliability of the protocol Assume an initial state Now look at result of running B rounds of gossip: converges exponentially quickly towards atomic delivery

… or data gets through w.h.p. Either sender fails… … or data gets through w.h.p.

Failure analysis Suppose someone tells me what they hope to “avoid” Model as a predicate on final system state Can compute the probability that pbcast would terminate in that state, again from the model

Two predicates Predicate I: A faulty outcome is one where more than 10% but less than 90% of the processes get the multicast … Think of a probabilistic Byzantine General’s problem: a disaster if many but not most troops attack

Two predicates Predicate II: A faulty outcome is one where roughly half get the multicast and failures might “conceal” true outcome … this would make sense if using pbcast to distribute quorum-style updates to replicated data. The costly hence undesired outcome is the one where we need to rollback because outcome is “uncertain”

Two predicates Predicate I: More than 10% but less than 90% of the processes get the multicast Predicate II: Roughly half get the multicast but crash failures might “conceal” outcome Easy to add your own predicate. Our methodology supports any predicate over final system state

Figure 5: Graphs of analytical results

Example of a question Create a group of 8 members Perturb one member in style of Figure 1 Now look at “stability” of throughput Measure rate of received messages during periods of 100ms each Plot histogram over life of experiment

Source to dest latency distributions Notice that in practice, bimodal multicast is fast!

Now revisit Figure 1 in detail Take 8 machines Perturb 1 Pump data in at varying rates, look at rate of received messages

Revisit our original scenario with perturbations (32 processes)

Throughput variation as a function of scale

Impact of packet loss on reliability and retransmission rate Notice that when network becomes overloaded, healthy processes experience packet loss!

What about growth of overhead? Look at messages other than original data distribution multicast Measure worst case scenario: costs at main generator of multicasts Side remark: all of these graphs look identical with multiple senders or if overhead is measured elsewhere….

Growth of Overhead? Clearly, overhead does grow We know it will be bounded except for probabilistic phenomena At peak, load is still fairly low

Pbcast versus SRM, 0.1% packet loss rate on all links Tree networks Star networks

Pbcast versus SRM: link utilization

Pbcast versus SRM: 300 members on a 1000-node tree, 0 Pbcast versus SRM: 300 members on a 1000-node tree, 0.1% packet loss rate

Pbcast Versus SRM: Interarrival Spacing

Pbcast versus SRM: Interarrival spacing (500 nodes, 300 members, 1 Pbcast versus SRM: Interarrival spacing (500 nodes, 300 members, 1.0% packet loss)

Real Data: Bimodal Multicast on a 10Mbit ethernet (35 Ultrasparc’s) Injected noise, retransmission limit disabled Injected noise, retransmission limit re-enabled

Networks structured as clusters Sender Receiver High noise rate Sender Receiver Limited bandwidth (1500Kbits) Bandwidth of other links:10Mbits

Delivery latency in a 2-cluster LAN, 50% noise between clusters, 1% elsewhere

Requests/repairs and latencies with bounded router bandwidth

Gossipping in Bologna Ozalp Babaoglu

Background 2003: Márk Jelasity brings the gossipping gospel to Bologna from Amsterdam 2003-2006: We get good milage from gossipping in the context of Project BISON 2005-present: Continue to get milage in the context of Project DELIS

What have we done? We have used gossipping to obtain fast, robust, decentralized solutions for Aggregation Overlay topology management Heartbeat synchronization Cooperation in selfish environments

Collaborators Márk Jelasity Alberto Montresor Gianpaolo Jesi Toni Binci David Hales Stefano Arteconi

Proactive gossip framework // active thread do forever wait(T time units) q = SelectPeer() push S to q pull Sq from q S = Update(S,Sq) // passive thread do forever (p,Sp) = pull * from * push S to p S = Update(S,Sp)

Proactive gossip framework To instantiate the framework, need to define Local state S Method SelectPeer() Style of interaction push-pull push pull Method Update()

#1 Aggregation

Gossip framework instantiation Style of interaction: push-pull Local state S: Current estimate of global aggregate Method SelectPeer(): Single random neighbor Method Update(): Numerical function defined according to desired global aggregate (arithmetic/geometric mean, min, max, etc.)

Exponential convergence of averaging

Properties of gossip-based aggregation In gossip-based averaging, if the selected peer is a globally random sample, then the variance of the set of estimates decreases exponentially Convergence factor:

Robustness of network size estimation 1000 nodes crash at the beginning of each cycle

Robustness of network size estimation 20% of messages are lost

#2 Topology Management

Gossip framework instantiation Style of interaction: push-pull Local state S: Current neighbor set Method SelectPeer(): Single random neighbor Method Update(): Ranking function defined according to desired topology (ring, mesh, torus, DHT, etc.)

Mesh Example

Sorting example

Exponential convergence - time

Exponential convergence - network size

Heartbeat Synchronization #3 Heartbeat Synchronization

Synchrony in nature Nature displays astonishing cases of synchrony among independent actors Heart pacemaker cells Chirping crickets Menstrual cycle of women living together Flashing of fireflies Actors may belong to the same organism or they may be parts of different organisms

Coupled oscillators The “Coupled oscillator” model can be used to explain the phenomenon of “self-synchronization” Each actor is an independent “oscillator”, like a pendulum Oscillators coupled through their environment Mechanical vibrations Air pressure Visual clues Olfactory signals They influence each other, causing minor local adjustments that result in global synchrony

Fireflies Certain species of (male) fireflies (e.g., luciola pupilla) are known to synchronize their flashes despite: Small connectivity (each firefly has a small number of “neighbors”) Communication not instantaneous Independent local “clocks” with random initial periods

Gossip framework instantiation Style of interaction: push Local state S: Current phase of local oscillator Method SelectPeer(): (small) set of random neighbors Method Update(): Function to reset the local oscillator based on the phase of arriving flash

Experimental results

Exponential convergence

Cooperation in Selfish Environments #4 Cooperation in Selfish Environments

Outline P2P networks are usually open systems Possibility to free-ride High levels of free-riding can seriously degrade global performance A gossip-based algorithm can be used to sustain high levels of cooperation despite selfish nodes Based on simple “copy” and “rewire” operations

Gossip framework instantiation Style of interaction: pull Local state S: Current utility, strategy and neighborhood within an interaction network Method SelectPeer(): Single random sample Method Update(): Copy strategy and neighborhood if the peer is achieving better utility

SLAC Algorithm: “Copy and Rewire” F D G “Rewire” C Compare utilities A “Copy” strategy A H B K J

SLAC Algorithm: “Mutate” F D Link to random node G C A “Mutate” strategy A H B K Drop current links J

Prisoner’s Dilemma Prisoner’s Dilemma in SLAC Nodes play PD with neighbors chosen randomly in the interaction network Only pure strategies (always C or always D) Strategy mutation: flip current strategy Utility: average payoff achieved

Cycle 180: Small defective clusters

Cycle 220: Cooperation emerges

Cycle 230: Cooperating cluster starts to break apart

Cycle 300: Defective nodes isolated, small cooperative clusters formed

Phase transition of cooperation % of cooperating nodes

Broadcast Application How to communicate a piece of information from a single node to all other nodes While: Minimizing the number of messages sent (MC) Maximizing the percentage of nodes that receive the message (NR) Minimizing the elapsed time (TR)

Broadcast Application Given a network with N nodes and L links A spanning tree has MC = N A flood-fill algorithm has MC = L For fixed networks containing reliable nodes, it is possible to use an initial flood-fill to build a spanning tree from any node Practical if broadcasting initiated by a few nodes only In P2P applications this is not practical due to network dynamicity and the fact that all nodes may need to broadcast

The broadcast game Node initiates a broadcast by sending a message to each neighbor Two different node behaviors determine what happens when they receive a message for the first time: Pass: Forward the message to all neighbors Drop: Do nothing Utilities are updated as follows: Nodes that receive the message gain a benefit β Nodes that pass the message incur a cost γ Assume β > γ > 0, indicating nodes have an incentive to receive messages but also an incentive to not forward them

1000-node static random network

1000-node high churn network

Fixed random network Average over 500 broadcasts x 10 runs

High churn network Average over 500 broadcasts x 10 runs

Some food for thought What is it that makes a protocol “gossip based”? Cyclic execution structure (whether proactive or reactive) Bounded information exchange per peer, per cycle Bounded number of peers per cycle Random selection of peer(s)

Some food for thought Bounded information exchange per peer, per round implies Information condensation — aggregation Is aggregation the mother of all gossip protocols?

Some food for thought Is exponential convergence a universal characterization of all gossip protocols? No, depends on the properties of the peer selection step What are the minimum properties for peer selection that are necessary to guarantee exponential convergence?

Gossip versus evolutionary computing What is the relationship between gossip and evolutionary computing? Is one more powerful than the other? Are they equal?