Leiden; Dec 06Gossip-Based Networking Workshop1 Gossip Algorithms and Emergent Shape Ken Birman
Leiden; Dec 06Gossip-Based Networking Workshop2 On Gossip and Shape Why is gossip interesting? Scalability…. Protocols are highly symmetric Although latency often forces us to bias them Powerful convergence properties… Especially in support of epidemics New forms of consistency… Probabilistic… but this is often adequate
Leiden; Dec 06Gossip-Based Networking Workshop3 Consistency Captures intuition that if A and B compare their states, no contradiction is evident In systems with “logical” consistency, we say things like “A’s history and B’s are closed under causality, and A is a prefix of B” With probabilistic systems we seek rapidly decreasing probability (as time elapses) that A knows “x” but B doesn’t …. Probabilistic convergent consistency
Leiden; Dec 06Gossip-Based Networking Workshop4 Exponential convergence A subclass of convergence behaviors Not all gossip protocols offer exponential convergence But epidemic protocols do have this property… and many gossip protocols implement epidemics
Leiden; Dec 06Gossip-Based Networking Workshop5 Value of exponential convergence An exponentially convergent protocol overwhelms mishaps and even attacks Requires that new information reach relevant nodes with at most log(N) delay Can talk of “probability 1.0” outcomes Even model simplifications (such as idealized network) are washed away! Predictions rarely “off” by more than a constant A rarity: a theory relevant to practice!
Leiden; Dec 06Gossip-Based Networking Workshop6 Convergent consistency To illustrate our point, contrast Cornell’s Kelips system with MIT’s Chord Kelips is convergent. Chord isn’t
Leiden; Dec 06Gossip-Based Networking Workshop7 Kelips (Linga, Gupta, Birman) Take a a collection of “nodes”
Leiden; Dec 06Gossip-Based Networking Workshop8 Kelips N N members per affinity group Map nodes to affinity groups Affinity Groups: peer membership thru consistent hash
Leiden; Dec 06Gossip-Based Networking Workshop9 Kelips Affinity Groups: peer membership thru consistent hash 1N Affinity group pointers N members per affinity group idhbeatrtt ms ms Affinity group view 110 knows about other members – 230, 30…
Leiden; Dec 06Gossip-Based Networking Workshop10 Affinity Groups: peer membership thru consistent hash Kelips N Contact pointers N members per affinity group idhbeatrtt ms ms Affinity group view groupcontactNode …… 2202 Contacts 202 is a “contact” for 110 in group 2
Leiden; Dec 06Gossip-Based Networking Workshop11 Affinity Groups: peer membership thru consistent hash Kelips N Gossip protocol replicates data cheaply N members per affinity group idhbeatrtt ms ms Affinity group view groupcontactNode …… 2202 Contacts resourceinfo …… cnn.com110 Resource Tuples “cnn.com” maps to group 2. So 110 tells group 2 to “route” inquiries about cnn.com to it.
Leiden; Dec 06Gossip-Based Networking Workshop12 How it works Kelips is entirely gossip based! Gossip about membership Gossip to replicate and repair data Gossip about “last heard from” time used to discard failed nodes Gossip “channel” uses fixed bandwidth … fixed rate, packets of limited size
Leiden; Dec 06Gossip-Based Networking Workshop13 How it works Heuristic: periodically ping contacts to check liveness, RTT… swap so-so ones for better ones. Node 102 Gossip data stream Hmm…Node 19 looks like a much better contact in affinity group RTT: 235ms RTT: 6 ms Node 175 is a contact for Node 102 in some affinity group
Leiden; Dec 06Gossip-Based Networking Workshop14 Work in progress… Prakash Linga is extending Kelips to support multi-dimensional indexing, range queries, self-rebalancing Kelips has limited incoming “info rate” Behavior when the limit is continuously exceeded is not well understood. Will also study this phenomenon
Leiden; Dec 06Gossip-Based Networking Workshop15 Replication makes it robust Kelips should work even during disruptive episodes After all, tuples are replicated to N nodes Query k nodes concurrently to overcome isolated crashes, also reduces risk that very recent data could be missed … we often overlook importance of showing that systems work while recovering from a disruption
Leiden; Dec 06Gossip-Based Networking Workshop16 Chord (MIT group) The MacDonald’s of DHTs A data structure mapped to a network Ring of nodes (hashed id’s) Superimposed binary lookup trees Other cached “hints” for fast lookups Chord is not convergently consistent
Leiden; Dec 06Gossip-Based Networking Workshop17 Chord picture Cached link Finger links
Leiden; Dec 06Gossip-Based Networking Workshop18 Chord picture Europe USA Transient Network Partition
Leiden; Dec 06Gossip-Based Networking Workshop19 … so, who cares? Chord lookups can fail… and it suffers from high overheads when nodes churn Loads surge just when things are already disrupted… quite often, because of loads And can’t predict how long Chord might remain disrupted once it gets that way Worst case scenario: Chord can become inconsistent and stay that way The Fine Print The scenario you have been shown is of low probability. In all likelihood, Chord would repair itself after any partitioning failure that might really arise. Caveat emptor and all that.
Leiden; Dec 06Gossip-Based Networking Workshop20 Saved by gossip! Epidemic gossip: remedy for what ails Chord! c.f. Epichord (Liskov), Bambou Key insight: Gossip based DHTs, if correctly designed, are self-stabilizing!
Leiden; Dec 06Gossip-Based Networking Workshop21 Connection to self-stabilization Self-stabilization theory Describe a system and a desired property Assume a failure in which code remains correct but node states are corrupted Proof obligation: show that property is reestablished within bounded time But doesn’t bound badness when transient disruption is occuring
Leiden; Dec 06Gossip-Based Networking Workshop22 Beyond self-stabilization Tardos poses a related problem Consider behavior of the system while an endless sequence of disruptive events occurs System never reaches a quiescent state Under what conditions will it still behave correctly? Results of form “if disruptions satisfy then correctness property is continuously satisfied” Hypothesis: with convergent consistency we may be able to develop a proof framework for systems that are continuously safe.
Leiden; Dec 06Gossip-Based Networking Workshop23 Let’s look at a second example Astrolabe system uses a different emergent data structure – a tree Nodes are given an initial location – each knows its “leaf domain” Inner nodes are elected using gossip and aggregation
Astrolabe Intended as help for applications adrift in a sea of information Structure emerges from a randomized gossip protocol This approach is robust and scalable even under stress that cripples traditional systems Developed at RNS, Cornell By Robbert van Renesse, with many others helping… Today used extensively within Amazon.com
Leiden; Dec 06Gossip-Based Networking Workshop25 Astrolabe is a flexible monitoring overlay NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal swift.cs.cornell.edu cardinal.cs.cornell.edu Periodically, pull data from monitored systems NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal
Leiden; Dec 06Gossip-Based Networking Workshop26 Astrolabe in a single domain Each node owns a single tuple, like the management information base (MIB) Nodes discover one-another through a simple broadcast scheme (“anyone out there?”) and gossip about membership Nodes also keep replicas of one-another’s rows Periodically (uniformly at random) merge your state with some else…
Leiden; Dec 06Gossip-Based Networking Workshop27 State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal swift.cs.cornell.edu cardinal.cs.cornell.edu
Leiden; Dec 06Gossip-Based Networking Workshop28 State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal swift.cs.cornell.edu cardinal.cs.cornell.edu swift cardinal
Leiden; Dec 06Gossip-Based Networking Workshop29 State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal swift.cs.cornell.edu cardinal.cs.cornell.edu
Leiden; Dec 06Gossip-Based Networking Workshop30 Observations Merge protocol has constant cost One message sent, received (on avg) per unit time. The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so Information spreads in O(log N) time But this assumes bounded region size In Astrolabe, we limit them to rows
Leiden; Dec 06Gossip-Based Networking Workshop31 Big systems… A big system could have many regions Looks like a pile of spreadsheets A node only replicates data from its neighbors within its own region
Leiden; Dec 06Gossip-Based Networking Workshop32 Scaling up… and up… With a stack of domains, we don’t want every system to “see” every domain Cost would be huge So instead, we’ll see a summary NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal cardinal.cs.cornell.edu NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal
Leiden; Dec 06Gossip-Based Networking Workshop33 NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris Astrolabe builds a hierarchy using a P2P protocol that “assembles the puzzle” without any servers NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris San Francisco New Jersey SQL query “summarizes” data Dynamically changing query output is visible system-wide NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris
Leiden; Dec 06Gossip-Based Networking Workshop34 Large scale: “fake” regions These are Computed by queries that summarize a whole region as a single row Gossiped in a read-only manner within a leaf region But who runs the gossip? Each region elects “k” members to run gossip at the next level up. Can play with selection criteria and “k”
Leiden; Dec 06Gossip-Based Networking Workshop35 Hierarchy is virtual… data is replicated NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris San Francisco New Jersey Yellow leaf node “sees” its neighbors and the domains on the path to the root. Falcon runs level 2 epidemic because it has lowest load Gnu runs level 2 epidemic because it has lowest load
Leiden; Dec 06Gossip-Based Networking Workshop36 Hierarchy is virtual… data is replicated NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris San Francisco New Jersey Green node sees different leaf domain but has a consistent view of the inner domain
Leiden; Dec 06Gossip-Based Networking Workshop37 Worst case load? A small number of nodes end up participating in O(log fanout N) epidemics Here the fanout is something like 50 In each epidemic, a message is sent and received roughly every 5 seconds We limit message size so even during periods of turbulence, no message can become huge.
Leiden; Dec 06Gossip-Based Networking Workshop38 Self-stabilization? Like Kelips, it seems that Astrolabe Is convergently consistent, self-stabilizing And would “ride out” a large class of possible failures But Astrolabe would be disrupted by Incorrect aggregation (Byzantine faults) Correlated failure of all representatives of some portion of the tree
Leiden; Dec 06Gossip-Based Networking Workshop39 Focus on emergent shape Kelips: Nodes start with a-priori assignment to affinity groups, end up with a superimposed pointer structure Astrolabe: Nodes start with a-priori leaf domain assignments, build the tree What other data structures can be constructed with emergent protocols?
Leiden; Dec 06Gossip-Based Networking Workshop40 Emergent shape We know a lot about a related question Given a connected graph, cost function Nodes have bounded degree Use a gossip protocol to swap links until some desired graph emerges Another related question Given a gossip overlay, improve it by selecting “better” links (usually, lower RTT) Example: The “Anthill” framework of Alberto Montresor, Ozalp Babaoglu, Hein Meling and Francesco Russo
Leiden; Dec 06Gossip-Based Networking Workshop41 Example of an open problem Given a description of a data structure (for example, a balanced tree) … design a gossip protocol such that the system will rapidly converge towards that structure even if disrupted Do it with bounced per-node message rates, sizes (network load less important) Use aggregation to test tree quality?
Leiden; Dec 06Gossip-Based Networking Workshop42 Van Renesse’s dreadful aggregation tree A B C D E F G H I J K L M N O P ACEGIKMO B F J N DL An event e occurs at H P learns O(N) time units later! G gossips with H and learns e
Leiden; Dec 06Gossip-Based Networking Workshop43 What went wrong? In Robbert’s horrendous tree, each node has equal “work to do” but the information-space diameter is larger! Astrolabe benefits from “instant” knowledge because the epidemic at each level is run by someone elected from the level below
Leiden; Dec 06Gossip-Based Networking Workshop44 Insight: Two kinds of shape We’ve focused on the aggregation tree But in fact should also think about the information flow tree
Leiden; Dec 06Gossip-Based Networking Workshop45 Information space perspective Bad aggregation graph: diameter O(n) Astrolabe version: diameter O(log(n)) H – G – E – F – B – A – C – D – L – K – I – J – N – M – O – P A – B C – D E – F G – H I – J K – L M – N O – P
Leiden; Dec 06Gossip-Based Networking Workshop46 Gossip and bias Often useful to “bias” gossip, particularly if some links are fast and others are very slow A C D F Z E Y B X Roughly half the gossip will cross this link! Demers: Shows how to adjust probabilities to even the load. Ziao later showed that must also fine-tune gossip rate
Leiden; Dec 06Gossip-Based Networking Workshop47 Gossip and bias Idea: “shaped” gossip probabilities Gravitational Gossip (Jenkins) Groups carry multicast event streams reporting evolution of a continuous function (like electric power loads) Some nodes want to watch “closely” while others only need part of the data
Leiden; Dec 06Gossip-Based Networking Workshop48 Gravitational Gossip ac b Inner ring: Nodes want 100% of the traffic Middle ring: Nodes want 75% of the traffic Outer ring: Nodes want 20% of the traffic Jenkins: When a gossips to b, includes information about topic t in a way weighted by b’s level of interest in topic t
Leiden; Dec 06Gossip-Based Networking Workshop49 Gravitational Gossip
Leiden; Dec 06Gossip-Based Networking Workshop50 How does bias impact information-flow graph? Earlier, all links were the same Now, some links carry Less information And may have longer delays (Bounded capacity messages: “similar” to long links?) Question: Model biased information flow graphs and explore implications
Leiden; Dec 06Gossip-Based Networking Workshop51 Questions about bias When does the biasing of gossip target selection break analytic results? Example: Alves and Hopcroft show that with fanout too small, gossip epidemics can die out, logically partitioning a system Question: Can we relate the question to flooding on an expander graph?
Leiden; Dec 06Gossip-Based Networking Workshop52 Can an overlay learn its shape? The unifying link across the many examples we’ve cited And needed in real world overlays when these adapt to underlying topology… For example, to pick “super-peers” Hints of a theory of adaptive tuning? Related to “sketches” in databases
Leiden; Dec 06Gossip-Based Networking Workshop53 Suppose we wanted to use gossip to implement decentralized balanced overlay trees (nodes “know” locations) Build some sort of overlay Randomly select initial upper-level nodes Jiggle until they are nicely spaced Instance of “facility location” problem Can an overlay learn its shape?
Leiden; Dec 06Gossip-Based Networking Workshop54 Emergent landmarks
Leiden; Dec 06Gossip-Based Networking Workshop55 Emergent landmarks Too many Too few
Leiden; Dec 06Gossip-Based Networking Workshop56 Emergent landmarks
Leiden; Dec 06Gossip-Based Networking Workshop57 Emergent landmarks
Leiden; Dec 06Gossip-Based Networking Workshop58 Emergent landmarks One per cluster Centroi d
Leiden; Dec 06Gossip-Based Networking Workshop59 2-D scenario: Gossip variant of Euclidian k-medians (a popular STOC topic) Nodes know position on a line, can gossip In log(N) time seek k equally spaced nodes Later, after disruption, reconverge in log(N) time NB: The facility placement problem is just the d-dimensional k-centroids problem K=3
Leiden; Dec 06Gossip-Based Networking Workshop60 Why is this hard? After k rounds, each node has only communicated with 2k other nodes In log(N) time, aggregate information has time to transit the whole graph, but with bounded bandwidth we don’t have time to send all data
Leiden; Dec 06Gossip-Based Networking Workshop61 Emergent shape problem Can a gossip system learn its own shape in a convergently consistent way? Start with centralized solutions draw from the optimization community. Then seek decentralized gossip variants that achieve convergently consistent approximations, (re)converge in log(N) rounds, and use bounded bandwidth
Leiden; Dec 06Gossip-Based Networking Workshop62 … more questions Notice that Astrolabe forces participants to agree on what the aggregation hierarchy should contain In effect, we need to “share interest” in the aggregation hierarchy This allows us to bound the size of messages (expected constant) and the rate (expected constant per epidemic) Optional content (time permitting)
Leiden; Dec 06Gossip-Based Networking Workshop63 The question Could we design a gossip-based system for “self-centered” state monitoring? Each node poses a query, Astrolabe style, on the state of the system We dynamically construct an overlay for each of these queries The “system” is the union of these overlays Optional content (time permitting)
Leiden; Dec 06Gossip-Based Networking Workshop64 Self-centered monitoring Optional content (time permitting)
Leiden; Dec 06Gossip-Based Networking Workshop65 Self-centered queries… Offhand, looks like a bad idea If everyone has an independent query And everyone is iid in all the obvious ways Than everyone must invest work proportional to the number of nodes monitored by each query If O(n) queries touch O(n) nodes, workload grows as O(n) and global load as O(n 2 ) Optional content (time permitting)
Leiden; Dec 06Gossip-Based Networking Workshop66 Aggregation … but in practice, it seems unlikely that queries would look this way More plausible is something Zipf-like A few queries look at broad state of system Most look at relatively few nodes And a small set of aggregates might be shared by the majority of queries Assuming this is so, can one build a scalable gossip overlay / monitoring infrastructure? Optional content (time permitting)
Leiden; Dec 06Gossip-Based Networking Workshop67 Summary of possible topics… Consistency models Necessary conditions for convergent consistency Self-stabilization Implications of bias Emergent structures Self-centered aggregation Bandwidth-limited systems; “sketches” Using aggregation to fine-tune a shape with constant costs